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Abstract 


In recent years, the large amount of continuous and heterogeneous data 
generated by the Internet of Things (IoT) sensors and devices made 
their record and the query search tasks much more difficult. Most of the 
state-of-the-art methods have failed to deal with the new IoT require- 
ments. In this thesis, the kNN search method combined parallelism 
was used for similarity queries search in proposed methods developed 
in metric space in the Fog-Cloud architecture. The first proposition is 
the Binary tree based on Containers at the Cloud-Clusters Fog com- 
puting level (B3CF-tree) which is an index constructed by combining 
DBSCAN clustering and parallelism. The simulation results of the in- 
dex construction and the parallel kNN query search showed that the 
B3CF-tree surpassed those in literature. The second proposition is 
the Coefficient of Variation (CV) method which was developed for in- 
dexing continuous IoT data stream. In this method, the first data 
stream is grouped into clusters using the DBSCAN algorithm. Data 
in these clusters are directly indexed in parallel. After the clustering 
of the arrival data stream, the data in clusters are inserted in exist- 
ing indexes or new indexes are constructed basing on the coefficient 
of variation value. This method has proven its efficiency in term of 
the indexes construction and the parallel kKNN query search compared 
with two other methods representing the two utmost cases namely the 


Creation of a New Index (CNI) and the Insertion in an Existing In- 


dex (IEI) methods. The third proposition is the Threshold Distance 
(TD) method which looked like the CV method. However, in the TD 
method, the arrival clusters are indexed or inserted in existing indexes 
basing on the comparison of the distance between their centers and 
the first clusters centers with a threshold distance TD. This method 
outperforms the Creation of a New Tree (CNT) method in terms of 
trees construction and parallel kNN search however, it is quite insuffi- 
cient compared with the results of the CV method. The experimental 
results showed that Both methods surpassed some indexing methods 
in literature and could be considered as an alternative method for in- 
dexing continuous IoT big data. The last proposition is the Quad tree 
based on Containers at the Cloud- Fog computing level (QCCF-tree) 
in which data are directly indexed without clustering. The comparison 
of the experimental results of the index construction and the parallel 
kNN search in the index nodes with some indexes in literature showed 
that the QCCF-tree is more efficient than these indexes. This made it 
a candidate as an alternative method for big IloT data indexing even 


though it presented a weakness in front of the B3CF-tree. 
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Résumé 


Récemment, la grande quantité de données continues et hétérogénes, 
générées par les capteurs et les composants IoT, a rendu l’enregistrement 
des données et la recherche des requétes des taches trés difficiles. La 
plus part des méthodes de l'état de art ont échoué de traiter les exi- 
gences de ’IloT. Dans cette thése, la méthode de recherche kNN com- 
binée avec le parallélisme a été utilis¢e pour la recherche des requétes 
similaires dans des structures proposées développées dans l’espace mé- 
trique dans l’architecture Fog-Cloud. La premiére proposition est le 
Binary tree based on Containers at the Cloud-Clusters Fog computing 
level (arbre-B3CF) qui est un indexe construit par la combinaison du 
regroupement par l’algorithme DBSCAN et le parallélisme. Les résul- 
tats de simulation de la construction de l’arbre et de la recherche des 
requétes par la méthode kNN paralléle ont montré que l’arbre-B3CF 
a dépassé les autres dans la littérature qui fait de lui un alternatif 
fort pour indexation des grandes données IoT. La seconde proposition 
est la méthode de Coefficient of Variation (CV) qui a été développée 
pour indexer les données continues. Dans cette méthode, le premier 
flux de données est groupé dans des clusters en utilisant l’algorithme 
DBSCAN. Les données dans ces clusters ont été directement indexées 
parallélement. Aprés le regroupement des données du flux arrivant, 
les données dans ces clusters sont insérées dans des indexes existants 
ou de nouveaux indexes seront construits selon la valeur du coefficient 
de variation. Cette méthode a prouvé son efficacité, en terme de la 
construction des indexes et la recherche des requétes par la méthode 
kNN paralléle, en la comparant avec deux méthodes représentants les 


deux cas extrémes notamment la méthode Creation of a New Index 


(CNI) et la méthode Insertion in an Existing Index (IEI). La troisiéme 
proposition est la méthode Threshold Distance (TD) qui ressemble a 
la méthode CV. Cependant, dans la méthode TD, les clusters arrivants 
sont indexés ou insérés dans des indexes existants en se basant sur la 
comparaison de la distance entre leurs centres et ceux des premiers clus- 
ters avec une distance seuil TD. Cette méthode a surpassé la méthode 
Creation of a New Tree (CNT) en termes de la construction des arbres 
et la recherche kNN paralléle cependant, elle est un peu inefficace en la 
comparant avec la méthode CV. Les résultats expérimentaux ont mon- 
trés que ces deux méthodes surpassées quelques méthodes d’indexation 
dans la littérature et peuvent étre considérées comme des méthodes al- 
ternatives pour l’indexation des données IoT continues. La derniére 
proposition est le Quad tree based on Containers at the Cloud- Fog 
computing level (arbre-QCCF) dans laquelle, les données sont direc- 
tement indexées sans le regroupement par l’algorithme DBSCAN. La 
comparaison des résultats expérimentaux de la construction de larbre 
et de la recherche kNN paralléle dans les noeuds de l’arbre avec quelques 
indexes dans la littérature a montré que l’arbre-QCCF est plus efficace 
que ces indexes. Cela fait de lui un candidat comme une méthode 
alternative pour l’indexation des données IoT méme s’il présente une 


faiblesse face A l’arbre-B3CF. 
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Introduction 


Motivations 


In the last decades, the Internet of Things (IoT) has found a wide range of use such 
as in smart cities, in smart home and in health care. This technology supports 
a large number of physical objects and devices with identities, personalities and 
network capabilities (Things) to transparently communicate and interact between 
them and with other network resources (Internet). These IoT devices provide 
services to facilitate life. They are heterogeneous and in many cases, they are 
deployed in distributed and dynamic environments over a large geographic region. 
They generate huge data that can overwhelm storage systems and causes a serious 
increase in their recovery time. A new forecast from International Data Corpo- 
ration (IDC) estimates that there will be 41.6 billion connected IoT devices, or 
“things,” generating 79.4 Zetta bytes (ZB) of data in 2025 [1]. The problem of data 
latency is considered as a serious obstacle when using cloud computing for storage 
and process of this big oT data. The causes of this data delay are still obvious |2]. 
Several researches have been made to address big IoT data storage and various 
papers have been published [3], [4] to improve cloud computing of data storage 
and queries retrieve algorithms. Recently, a few researches have addressed the 


data storage and retrieve using the fog computing [5], |6] because of its interesting 
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characteristics such as the closeness to the end users and the computation capabil- 
ities. In addition, in the fog computing, big loT data could be distributed in many 
fogs located in different geographic regions. Benefiting from the fog characteristics 
in indexing big IoT data will improve considerably the similarity queries search. 
Indeed, indexing of large-scale IoT data must be efficient, dynamic and support 


different data types. 


Context of the Study 


The massive data, generated by interconnected IoT devices require storage, pro- 
cess, analysis and finding effective methods for similarity queries search. To store 
these big IoT data, indexing methods are used. Indexing is one of the most widely 
used mechanisms to provide rapid access to data. Indexing is a data organiza- 
tion step that must allow efficient access to the data efficiently when performing 
similarity queries. The principle is to organize similar data to speed up searches 
|7|. The goal of any index is therefore to provide fast access to the objects in a 
database, by reducing the search space, the cost of input/output and the number 
of calculations of distances between objects. In other words, the index provides the 
efficient implementation of associative search |8]. For a better management of big 
IoT data, the fog computing architectures are currently used. The hierarchical fog 
architecture consists of three layers: terminal layer, fog layer and cloud layer |9]. 
Indexing of large-scale IoT data must be efficient, dynamic and support different 
data types. Metric spaces became popular in the indexing process. In order to 
exploit not the data representation itself, which has become too rich and complex, 


but to work "only" on the similarities that can be computed between objects. 


Furthermore, the nature of big IoT data is dynamic and its underlying data dis- 


tribution can change over time. Another point is that the data is produced in real 
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time. This necessitates development of IoT specific data analytics solutions which 
can handle the heterogeneity, dynamicity and velocity of the data. To group the 
data coming from the devices, clustering methods are used in which, the data is 
usually clustered according to different criteria; e.g. similarity and homogeneity. 
The clustering results in a data analysis scenario can be interpreted as categories 
in a dataset and can be used to assign data to various groups i.e. clusters [10]. 
The grouping of IoT data into clusters may allow the introduction of parallelism 


during both the indexes construction and similarity queries search. 


Objectives and Contributions 


IoT systems are comprised of various devices that generate heterogeneous IoT 
data continuously. This continuity involved a big challenge concerning the data 
indexing and the query search in the dynamic IoT environment. The traditional 
indexing methods became inadequate to index the big IoT data because they suf- 
fer from the issue of the degradation in large scale and they are unable to extend 
with the permanent collection of data. In addition, the direct use of the cloud 
infrastructure affected negatively the communication time due to the big physical 
distances between the data sources and the data warehouse. The aim of this thesis 
is to propose new systems for indexing and retrieving data in an IoT environment 
that allows dealing with the index degradation and network congestion while en- 
suring minimal search time with optimal results quality. To reach our objective, 


the following tasks were addressed: 


Proposition of a novel taxonomy ‘This taxonomy is based on the grouping of 


indexes of different types of data into centralized methods and distributed methods. 
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Relocalization of the indexing process from the cloud to the fog nodes 
In order to bring the data as close as possible to the indexing structure in order to 
considerably reduce network congestion. In addition, each fog node generates the 
indexing structure of the distributed IoT data which not only allows parallelism 
during the construction of trees, but also allows it during the queries search process 


through the simultaneous launch of the same query on all fog nodes. 


Division of the fog layer into levels In the cloud-fog architecture, the fog 


layer is divided into several levels in order to make a multi-steps indexing process. 


Use of metric space The indexes constructed in multidimensional space suffer 
from the depending on a specific data type and dimensions. The data processing, 
in the metric space, is easier to index because it depends only on the distance 


between objects whatever their types. 


Data clustering as a first step indexing process The DBSCAN algorithm 
is used, in the first fog level, to group IoT data into homogeneous clusters in order 


to reduce the data overlapping and index degradation. 


Parallel construction of trees This process takes place in the second fog level. 
In this level, B3CF-trees (Binary tree based on containers at the cloud-clusters fog 
computing level) of clusters resulting from the use of the DBSCAN algorithm in 


the clustering fog level, were constructed simultaneously. 


Use of hyper-planes for space partitioning For indexing IoT data in clus- 
ters, B3CF-trees are based on the metric space partitioning into hyper-planes using 


two pivots in order to guarantee a no-overlapping in indexes. 
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Indexing continuous data stream using the Coefficient of Variation (CV) 
method In order to index continuous IoT data stream in BH-trees (Binary trees 
with Hyper-plane), the Coefficient of Variation (CV) method is used in the cluster 


processing fog level located between the clustering level and the indexing level. 


Indexing continuous data stream using the Threshold Distance (TD) 
method In the Threshold Distance (TD) method, proposed for indexing contin- 
uous IoT data stream in GHTs (Generalised Hyper-plane Trees), the fog layer is 


divided only into a clustering level and an indexing level. 


Use of balls for space partitioning A proposed index called QCCF-tree 
(Quad-tree based on Containers at the Cloud-Fog computing level) is constructed, 
in the fog node, basing on the metric space partitioning into four balls using four 
pivots in order to reduce the index degradation and to speed up the query search. 


In this approach, the fog layer is not divided. 


Use of parallel KNN search in the proposed binary trees The use of the 
DBSCAN algorithm for data clustering allows not only the parallel construction 
but also the parallel query search. The kNN search method is combined with 


parallelism in order to improve the similarity queries search process. 


Use of parallel kKNN search in the QCCF-tree nodes_ The parallelism is 
combined with the kNN search method in the inner of the QCCF-tree i.e. in the 


QCCF-tree nodes, in order to speed up the similarity queries search. 
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Overview of the Thesis 


This thesis will be presented in two parts in addition to the introduction and the 
conclusion sections. The first part, untitled IloT data indexing in metric spaces: 
definitions and related work, contains four chapters. The first chapter deals with 
the mathematical definitions of some concepts and the different methods of simi- 
larity query search in metric space. In the second chapter, the internet of things 
definition will be presented as well as the characteristics and the different chal- 
lenges in addition to an overview in the cloud computing and the fog computing. 
In the third chapter, big IoT data is defined and the clustering methods are de- 
scribed in detail in addition to other data analytics methods. The fourth chapter 
regrouped a state of the art concerning the centralized and the distributed in- 
dexing methods in multi-dimensional and in metric spaces by focusing on their 
advantages and limitations. The proposed approaches, in metric space, in the 
cloud-fog architecture are gathered in the second part which contains four chap- 
ters. The first chapter presents the parallel kNN search in the proposed B3Cf-trees 
constructed using the DBSCAN clustering combined with parallel indexation. In 
the second chapter, the parallel kNN search is used for similarity search in BH-tree 
constructed for indexing continuous data stream using the Coefficient of Variation 
(CV) method. The third chapter presents the parallel kNN search of queries in 
GH-trees constructed for indexing continuous IoT data stream using the Threshold 
Distance (TD) method. The last chapter presents the parallel kKNN query search 
in the QCCF-tree nodes constructed by the indexing of the whole IoT data in a 
quad tree. For each proposition, a detailed description, algorithms, a description 
of the computation platform, the used datasets, the computation parameters are 
provided. The experimental results in terms of index construction and the kNN 
query search process will be presented, discussed and compared with those in lit- 


erature for each proposition. A comparison between the proposed approaches will 
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be also provided. 


Part I 


loT data Indexing in Metric Space: 
Definitions and Related Work 


1 Metric Spaces 


1.1 Introduction 


The widespread use of smart objects connected to Internet such as sensors, ac- 
tuators and embedded devices, has led to an increase in the amount of collected 
data [11]. The type of these data is heterogeneous, dynamic and its corresponding 
data distribution can change over time [10]. On the other hand, the data comes in 
large quantities and is produced in real-time. This data needs to be processed and 
stored in a manner allowing its retrieve quickly. Many approaches, developed in 
the the multidimensional space, have presented some disadvantages when storing 
this heterogeneous data in terms of size, type and dimension [12], [13], [14]. In this 
work, the metric space is proposed to be the right compromise since, in this space, 
only distances between data are used regardless of their types and dimensions |15]. 
Metric space has been proposed before as a universal abstraction for data [16]. 
Furthermore, multidimensional spaces is a special cases of metric spaces. In the 
vector space, objects are represented by vectors and geometric properties of some 
of these vectors can be used for research. These characteristics, of course, cannot 


be extended to metric distances [17]. 


In this chapter, we present the mathematical definition of metric space and the 


ball and the hyperplane data partitioning concepts in metric space. Lastly, we 
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provide definitions of some similarity query search methods in metric space. 


1.2 Metric Space Definition 


A metric space is a set of objects where a notion of distance between objects is 


defined. The mathematical definition is given as follows: 


Definition 1.2.1 (Metric space). A metric space M = (0, d) is defined by a dis- 
tance function d and a dataset O. The distance function d measures the similarity 
between two elements from the given dataset O. Similar objects correspond to 


smaller distances. Being a metric space (O, d) where © a set of points and d a dis- 


tance function defined as: d: 0 x 0 > Rt. The distance function d characterized 


by: 


1.Non — negativity : V(x, y) € 0°, d(x, y) > 0. (1.1) 
2.Reflexivity : Vx € O, d(z, x) = 0. (1.2) 
3.Symmetry : V(x, y) € 0, d(x, y) = d(y, 2). (1.3) 
4.Triangle — inequality : V(x,y,z) € 0°, d(az, y) + d(y, z) < d(z, z). (1.4) 


1.3. Multidimensional Space Definition 


A multidimensional space is defined as a set of objects, called vectors, homogeneous 


or heterogeneous. The most usual case is that of orthonormal subspaces, i.e. defined 


on Rt. In a vector space, objects are represented by vectors and the geometrical 
properties of these vectors are exploited for research. However, these properties 
cannot be extended to metric spaces |17]. Multidimensional spaces are subsets of 


the metric space, so any norm on a multidimensional space is a subset of a metric 
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space |18]. 


1.4 Distance Functions 


Distance functions are tools for measuring the proximity between different objects 
and are suitable for specific applications. These metrics are based on coordinates 
and can be divided into two groups: discrete distance functions and continuous 
distance functions. Discrete distance functions give only a small set of values, 
whereas with continuous distance functions, the cardinality of the resulting set of 


values is very large or infinite. 


Furthermore, distance functions may be classified in terms of their cost of calcu- 


lation, just by taking into account their approximate complexity. 


1.4.1 Minkowski distances 


The Minkowski distances are a family of metric functions, known as L, metrics, 
because the individual cases depend on the numerical parameter p which vary 
according to the type of data. The function is defined on n-dimensional vectors of 
real numbers which can be transformed into vectors of numbers. The mathematical 


definition is given as follows: 


Definition 1.4.1 (Minkowski distances). The Minkowski distances is defined by 


two vectors X € R" = (21, %2,--- ,2n) and Y € R” = (y1, yo,-°: 5 Yn), With p is an 


integer. we define the Minkowski distances where, 
Ly[X,Y] = 2) >) lai — wl? (1.5) 
i=1 


In the Minkowski distances, the most used values of the parameter p are p = 1,p = 
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2 and p = oo (Figure 1.1). Each curb represents a set of points, in the plane, at 
the same distance from the central point. A set of points, in the plane, at the same 
distance from the central point. p = 1 is translated by a losange, the circle for the 
metric p = 2 and p = ~ results in a square. The intermediate values produce a 


progressive bulge from the lozenge to the square via the circle [18]. 


1. For p= 1: usually the Manhattan distance, which is expressed in the follow- 


ing equation: 


(1.6) 


(1.7) 


3. For p = oo: known as the Chebyshev distance, the maximum distance and 


the infinite distance. Its equation is: 


Loo[X, ¥] = max [zi — yi (1.8) 


Figure 1.1: Different L, distance functions [15]. 
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1.5 Concepts of Ball and Hyperplane 


The partitioning, in large, represents the basic principles of all storage structures, 
designed to partition the search space into subsets, such that once a query is 
answered, only certain of these subsets will be searched. For partition in a metric 
space, the data set does not possess coordinates that can be used in the geometric 
divisions. ‘To solve this issue the metric space is based on selecting an object 
and promoting it to the pivot. All the other objects are classified by calculating 
the distance from this pivot. The choice of a certain value of distance acts as a 
threshold value and partitions the objects into two subsets. The researchers often 


exploit two key concepts: the ball partition and the hyperplane partition. 


1.5.1 Ball partitioning 


A ball is a general concept that allows us to generalize the disc in the Euclidean 
plane and the sphere in space. It is a set O of a metric space defined by a center 


object, or "pivot p", and a radius r (Figure 1.2). The use of pivot p € O and 


radius r € Rt allows the division of objects into two subsets S,; and Sj. The 


formal definition the ball partitioning is given: 


Definition 1.5.1 (Ball). Let M = (0, d) be a metric space. Let p € O be a pivot 


object and r € R* the covering radius. Then Ball(O,d,p,r) that is Ball(p,r), 


where there is no ambiguity in the metric space defined a ball which partitions the 


space into two subsets S; and S%: 


Si = {0 € O,d(p,o) <r} (1.9) 
Sy = {0 € O,d(p,o) > r} 


The redundant conditions < and > provide balance when the pivot is a median 
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Figure 1.2: Ball partitioning scheme [18]. 


value and is not unique. This is achieved by affecting each element at the median 


distance to one of the subsets in an arbitrary, however balanced, method. 


1.5.2 Hyperplane partitioning 


This partitioning concept also splits the set O into subsets by two pivots p,; and 
p2, Which are chosen arbitrarily (Figure 1.3). The rest of the objects o is assigned 


to S; or Sz according to their distances to the selected pivots p; and pz as follows: 


Definition 1.5.2 (Generalized hyperplane partitioning). Let M = (O,d) be a 
metric space. Let (p1,p2) € ©? be two pivots object with d(p,,p2) > 0. Then 
H(O,d, pi, p2) that is H(pi,p2), where there is no ambiguity in the metric space 
defined a hyperplane which partitions the space into two subsets S; and S9: 


5S, = {0 € 0, d(p1,0) < d(po, 0)} 
Sz = {0 € O,d(p1, 0) > d(p2, 0) } 


(1.10) 


The generalized hyperplane eliminate the overlapping between the data. Unlike 


the ball partitioning, the generalized hyperplane is not able to assure a balanced 
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distribution and a well adapted selection of pivot to attain this result is an attrac- 


tive problem. 


Figure 1.3: Hyperplane partitioning scheme [18]. 


1.6 Similarity Query Search 


The intense use of IoT devices induced the emergence of unstructured IoT data 
that contains many types of data such as images, videos and time series. These 
types of data cannot be organized in a classical way or searched in a very signifi- 
cant way using accurate database queries which would retrieve exact results. The 
more common approach to similarity search, enabling always the building of index 
structures, to be exploited in different spaces. Similarity search is a manner of 
information retrieval in that the query is an example of an object and the result 
desired is a set of objects considered similar - in some sense - to the query [19]. A 
similarity query is given as an explicit or an implicit definition of a query object q 
and by using a constraint on the form and range of the neighborhood query. The 
resulting response to a query finds any objects that satisfy the constraint, which 


are guaranteed to be the objects that are near the given query object. 
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1.6.1 Range query method 


A range query R(q,7) finds all objects o within a set X C O that have a distance, 
from the query q, less than r (Figure 1.4). 


R(q,r) = {o € X,d(o,¢) < r} (A) 


In the range query the query object q do not necessarily have to exist in the set 


(X C O) to be searched [18]. 


Figure 1.4: Range query description. 


1.6.2 Similarity join method 


The similarity join is performed by two different sets X¥ C O and Y C O. It was 
created by the need to use unstructured data to provide structured services [15]. 
The similarity join of two dissimilar datasets (X C ©) and (Y C OQ) retrieves 


all object pairs (x,y) € (X,Y) with distance not greater than a threshold value 
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pu > 0. The similarity join method is formally defined by the following equation: 
J(X,Y, um) = {(@,y) € (ALY), dla, y) < wy (1.12) 


The dataset X may be identical to the dataset Y, this case is called the self 


similarity join (Figure 1.5). If ~ = 0, we get the traditional natural join. 


0 1 2.3 
oe 2 ° 
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Figure 1.5: Similarity self join query with w = 2.5 [15]. 


1.6.3. Reverse nearest neighbor query method 


Inverting the nearest neighbor query, finds the objects in the set closest to the 
query object gq. The objects see the query object q as their nearest neighbor. 
It’s referred to as a reverse nearest neighbor search [15]. The basic definition of 
this query search method is to find every object related with qg as a k nearest 
neighbor (Figure 1.6). In this figure, dotted circles represent the distances to the 
second closest neighbor of the objects O;. The objects o4 and os satisfy the query 


2RNN(q), i.e. objects with g among their two nearest neighbors are represented 
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by blue dots [15]. The response set of the general query kRNN(q) is given by the 


following relation [15]: 


kKRNN(q) ={SCX,|S|=k,Ve eS: qeEkNN(x)AVrEeX-—S:q€EkNN(z)} 
(1.13) 
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Figure 1.6: Reverse nearest neighbor query with k=2 [15]. 


1.6.4 Nearest Neighbor query method 


The basic definition of this query search method is to find the closest object to 
the given query object q, i.e. the nearest neighbor of q [18]. The general case is 
where we search the k nearest neighbors (kKNN). Specifically, kKNN query finds the 
k; nearest neighbors of the object q. Figure 1.7 illustrates the situation for k = 5 
the objects O4, O19, Os, Oig and Ojg are closest to the query g. Formally, the set 


of the responses is defined as follows [15]: 


kKNN(q) ={SCX,|S|=KAVzeE S,ye xX —S:d(q,xz)<d(q,y)} (1.14) 
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Figure 1.7: kKNN query search with k=5. 


1.7 Conclusion 


The proposition of processing heterogeneous data, in term of type and dimension, 


in metric space as an alternative of the multidimensional space will be of a great 


interest. Indeed, in metric space, from the distance values, it is possible to distin- 


guish objects of high dimension namely, in the case of data that does not follow 


uniform distribution which is the case of the whole real data such as Internet of 


Things (loT) data. Using ball or hyperplane data partitioning in metric space 


makes easy the distinction between the right objects and the dismissed objects 


during the use of a similarity query search methods such as kNN. 
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2 Internet of Things (IoT) 


2.1 Introduction 


Nowadays, all devices such as smart home, smartphones, healthcare ones and home 
appliances have been applied for data generating. These massive data, generated 
by these interconnected devices, are known as Internet of Things (IoT). It is a 
dynamic network infrastructure, where physical and virtual "objects" have iden- 
tities, physical attributes, virtual personalities, and intelligent interfaces. In the 
past few years, numerous researches have been realized on IoT [20],[21]. However, 
very few publications have been found discussing and pointing out the challenges 
of IoT [22]. The aim of this work is to address one of these challenges which is the 


storage and the retrieve of information. 


In this chapter, we present the definition and the application of IoT before pre- 
senting IoT challenges. The cloud computing and the fog computing, proposed as 


solutions of IoT challenges are described in the end of this chapter. 


2.2 Ubiquitous Computing in the Future Decade 


The development of technology has caused the transition from the stage of personal 


computers to smartphones and other portable devices and the interaction between 
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them has changed our daily lives which induced a fundamental transformation in 
computing called Ubiquitous computing (UbiComp) [23]. Several approaches to 
ubiquitous computing have been appeared in the literature [24], [25], [26]. The one 
is the Weiser’s Calm Computing approach which was proposed by Mark Weiser, the 
ancestor of Ubiquitous computing [24]. He defined the intelligent environment as 
"the physical world richly and invisibly interwoven with sensors, actuators, displays 
and computational elements, seamlessly integrated into the everyday objects of our 
lives and connected via a continuous network". After that, Rogers [25] proposed 
a human centric UbiComp based on human creativity in using the environment 
to enhance their lives. This approach provides a solution of a specific UbiComp 
domain. In ref. [26], Caceres and Frida discussed the elements that make up 
UbiComp and the characteristics of the system to address the changing world. 
They point out two critical technologies for the growth of UbiComp which are 
Infrastructure-Cloud Computing and Internet of Things (IoT). 


2.3. IoT Definition 


The current internet has evolved into a network of interconnected objects so that 
they do not sense information and interact with the physical world. Rather, this 
development has expanded to provide services for information transfer, analysis 
and communication between them. This technological development, called the 
"Internet of Things", was appeared first in 1999 through Kevin Ashton in the 
supply chain management context [27]. Nevertheless, in the past decade, the term 
has become more widely inclusive, covering a wide range of applications such as 


health care, utilities and transportation [28]. 


IoT is the beginning of a new area of computing technology. It relies on the global 


incurable network in which different smart things communicate between them, 
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with machines, and with environments. The IoT definition varies, in literature 


from an author to an other. 


According to Van Kranenburg et al. [29], IoT is defined as a dynamic global net- 
work infrastructure with self-configuring capabilities based on standard and inter- 
operable communication protocols where physical and virtual ’Things’ have iden- 
tities, physical attributes, and virtual personalities and use intelligent interfaces, 
and are seamlessly integrated into the information network. According to Atzori 
et al. [30], the internet of things is based on three paradigms: internet-oriented 
(middleware), object-oriented (sensors) and semantic-oriented (knowledge). While 
this is a necessary distinction, because of the cross-disciplinary nature of the topic, 
the potential utility of the loT can only be released in an application domain where 
the three paradigms intersect. According to cluster of European research projects 
on IoT [28], "Things" are active participants in business, information and social 
processes where they are enabled to interact and communicate among themselves 
and with the environment by exchanging data and information sensed about the 
environment, while reacting autonomously to the real/physical world events and 
influencing it by running processes that trigger actions and create services with or 
without direct human intervention. The Radio Frequency IDentification (RFID) 
group describes IoT as The worldwide network of interconnected objects uniquely 
addressable based on standard communication protocols [23]. According to Gubbi 
et al. [23], IoT is an interconnection of sensing and actuating devices providing 
the ability to share information across platforms through a unified framework, de- 
veloping a common operating picture for enabling innovative applications. This is 
achieved by seamless ubiquitous sensing, data analytics and information represen- 
tation with cloud computing as the unifying framework. The definition provided 
by the ITU (International Telecommunication Union) is that IoT is a global infras- 


tructure for the information society enabling advanced services by interconnecting 
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(physical and virtual) things based on, existing and evolving, interoperable infor- 
mation and communication technologies [20]. According to Sharma et al. [21], 
the term Internet of Things (IoT) is a general concept for the capacity of the 
networked devices to sensor and capture data from all over the world and then 
distribute that data across the global internet where they may be analyzed and 
used for different useful applications. The IoT is smart machines communicating 
and interacting with other machines, objects, environments and infrastructures. 
According to Hukeriet al. [31], IoT is the growing network of objects or "things" 
integrated using electronics, sensors, software and connections to achieve higher 
value and service by communicating and service with the manufacturer operator or 
other interconnected devices. Each thing is distinctively by its embedded computer 


system but it is able to interoperate within the current internet infrastructure. 


From all these definitions, we can conclude that IoT is a dynamic global network 
infrastructure, where physical and virtual "things" have identities, physical at- 
tributes, virtual personalities and use intelligent interfaces. These things are able 
to interact and communicate with themselves and the environment by exchanging 


data and information. 


2.4 IoT Functional Blocks 


A IoT system is composed of a number of functional blocks to facilitate various 
utilities to the system. These blocks are device, communication, service, manage- 


ment, security and application [32]. 


e The devices block provides monitoring , detecting, actuating, and surveil- 
lance activities. Devices exchange data with other connected devices and 
applications, or collect data from other devices. They also process data lo- 


cally or send data to centralized servers or cloud-based application back-ends 
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to process data or perform some tasks locally and other tasks within the loT 
infrastructure depending on temporal and spatial constraints. IoT devices 
may also be of various types, for example, wearable sensors, smart watches, 


LED lights, automobiles, and industrial machines [32]. 


e The communication block ensures communication across devices and with 
distributed servers. IoT communication protocols generally function in the 


data link layer, network layer, transport layer and application layer [32]. 


e The services block in a IoT system supports different types of functions 
including device control, data management, device modeling, data delivery 


and device recovery services [32]. 


e The management block delivers various functions to govern an IoT system 


in order to research the IoT system’s underlying governance [32]. 


e The security of the loT system providing functions such as authentication, 


permission, confidentiality, message integrity and data security [32]. 


e The application block is the most critical in terms of users as it works as an 
interface that delivers the necessary modules to control and supervise various 


aspects of the IoT system [32]. 


2.5 IoT Architecture 


The continuous evolution of IoT due to the association of devices with other areas 
such as cloud computing allowed the improvement of the sensors, actuators and the 
creation of smaller devices with a network connection [33]. In this context billions 
or trillions of heterogeneous devices connected are increasing. To manage data 
collected from these devices, a flexible layered architecture seems to be a better 


way. The basic model is the three layer architecture [34], [35], [36]. It consists of 
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Figure 2.1: IoT architecture with three layers [38]. 


the perception layer, the network layer and the application layer (Figure 2.1). 


The perception layer is the deepest layer of the IoT architecture. Similar to its 
name indicates, its objective is to collect data from the environment. All the data 


collection and detection part is done on this layer [37]. 


The network layer contains the data received by the perception layer. It collects 
the data from the inferior layer and sends it to the Internet. The network layer 
can only contain a gateway, with one interface connected to the sensor network 


and one connected to the Internet. 


The application layer obtains information from the network layer and manages 
the application on a global base according to the information processed by the 
network layer. Based on the type of devices and their purpose in the perception 
layer and the way they have been processed by the network layer the application 
layer presents the data in the form of: smart city, smart home, smart transporta- 
tion, vehicle tracking, smart agriculture, smart health and many other types of 


applications [34]. 
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Figure 2.2: IoT architecture with five layers. 


In the literature, some other models have been proposed that add more abstraction 
to the IoT architecture [34], [35], [30] where it extended the three-layer architecture 
to a five-layer architecture. They add two more layers, middleware layer and 


business layer (Figure 2.2). 


The middleware layer links a service to its applicant according to addresses and 
names. It also links to the database to store, delivers the required services over 
the network wire protocols, received data and makes decisions from the network 


layer [34], [36]. 


The business layer supports the full range of operations and services of the IoT 
system. The business layer supports decision-making based on big data analysis 
[33]. In turn, the supervision and management of the four supporting layers is 
done at this layer. In this layer, the results of each layer are compared with the 


results of the other layers to improve services and preserve user privacy [34], [36]. 
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Figure 2.3: IoT applications domain [23]. 


2.6 IoT Applications 


IoT plays a major role in enhancing the quality of our lives in several applications 
include transportation, healthcare, industrial automations etc. These applications 
classified according to Gubbi et al. [23] into four areas: personal and home, 


enterprise, public Services and mobile. (Figure 2.3). 


2.6.1 Personal and Home 


The information captured by the sensors is used by individuals only from their own 
network mobile [23]. WiFi is generally used as a network backbone that allows 
for a higher bandwidth data transfer (video) and higher sampling rates (sound). 


Among home applications, we mention : 


e Control of domestic equipment including [21]: 


Zi. 
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— Energy and water consumption: monitoring energy and water con- 


sumption to get cost reduction advice and resources. 


— Remote controlled appliances: Turn on and off appliances remotely to 


prevent accidents and save energy. 


— Intrusion Detection Systems: window and door detection and door 


openings and violations to prevent intruders. 


— Preservation of art and property: Surveillance of conditions inside the 


museums inside museums and warehouses. 
e Healthcare: 


— Integrating sensors and actuators into patients and their medications 


for surveillance and follow up applications in hospitals. 


— Home surveillance systems for aged care, which allows to the doctor to 


monitor patients in their homes. 


2.6.2 Enterprise 


In a working environment as an enterprise, the data collected from the networks 
are only used by the owners and the data could be selectively liberated [23]. These 
networks are intelligent environments and we can cite smart home, smart city, 


smart agriculture, smart water and smart transportation [39]. 


2.6.3. Public Service 


The information from IoT networks generally used to improve services, such as 
improving energy consumption in smart homes by continuously monitoring every 


power point inside the home and use that information to improve the manner in 
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which electricity is consumed. In the video based internet of things, improve video 
monitoring where network camera surveillance applications help monitor targets 
and identify suspicious activities. In smart agriculture, improving the agricultural 


product where controlling the watering of agricultural land. 


2.6.4 Mobile 


There are two different domains of mobile applications: intelligent transportation 
and intelligent logistics. They are positioned in a distinct domain by the nature of 
the data exchange and backbone application needed [23]. Intelligent transportation 
[23] will allow large-scale Wireless Sensor Network (WSN) to be applied for online 
tracking of travel times, source-destination routing information, queue duration 
, pollutant and noise generation. Intelligent Logistics [40] includes the tracking 
of transported elements as the efficient planning of transports. The tracking of 
transported elements is carried out more locally, whereas the planning of transport 


is done by means of a large-scale IoT network. 


2.7 IoT Challenges 


2.7.1 Secure and privacy 


Security is a main challenge since it covers very large scale networks which can see 
several types of attacks. The three physical components of the IoT: RFID, WSN 
and cloud are vulnerable to these attacks. Security is essential for any network 
[41], [42]. According to Juels et al. [43], the most vulnerability component is the 
Radio Frequency IDentification (RFID) as it allows the people and objects to be 


tracked and no high intelligence may be enabled in these devices. 
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2.7.2 Availability 


The availability of the loT needs to be realized in hardware and software stages to 
provide services at any time and in any place to clients. Software availability relates 
specifically to the capability of the IoT applications to deliver services to anyone 
at different locations simultaneously and in the hardware availability relates to the 
continuous existence of devices that support IoT functionality and protocols [33]. 
A various devices together with a different variety of communication protocols via 
TCP/IP or advanced software stacks could certainly manage the web services that 


will be displayed by different middleware solutions [44]. 


2.7.3 Reliability 


Reliability is focused on improving the service delivery of IoT. However, loT de- 
ployment is very complicated and consists of heterogeneous networks and smart 
devices, which leads to a reliability challenge. Reliability is needed to be imple- 
mented in software and hardware in each IoT layer. For an efficient IoT, the 
underlying communication should be robust as e.g. by unreliable perception, data 
collection, processing and transmission can lead to long delays, loss of data and 
possibly bad decisions, which can lead to disastrous scenarios and therefore can 


make the IoT seem less reliable [45]. 


2.7.4 Mobility 


The IoT services delivered to mobile users is causing major challenges. These 
challenges include ensuring service continuity while users are on the move, service 
interruptions for mobile devices and the huge number of smart devices in IoT 


systems also require effective mechanisms for mobility management. 


30 


CHAPTER. 2 Internet of Things (IoT) 


2.7.5 Performance 


The IoT comprises an enormous number of components which provide services. 
The performance of IoT services is also affected by the performance of their compo- 
nents. These components require continuous supervision to ensure client demands 
are satisfied. Several measurements can be used to evaluate the performance of 
the IoT, such as processing speed, communication speed, device form factor and 


cost [33]. 


2.7.6 Management 


Managing IoT resources includes, configuration, accounting, performance and se- 
curity is a challenge because trillions of smart devices are connected. With the 
growing number of these resources, the development of light weight new manage- 
ment protocols to the standard management, that arise from the deployment of 


IoT in the coming years, become much more important. 


2.7.7 Scalability 


The addition of new functions and services for new equipments is a complex process 
within the IoT as various hardware platforms and communication protocols are 
available. In addition this scalability is achieved while not touching the quality of 


the current services. 


2.7.8 Interoperability 


The existence of heterogeneous and very complex network platforms in addition 
to the complexity between different devices types and their different communica- 
tion technologies makes from interoperability a challenge. Interoperability must 


be addressed by application developers and IoT device producers to guarantee the 


dl 


CHAPTER. 2 Internet of Things (IoT) 


service to all the clients, whatever the hardware platform they are using. Also 
interoperability must be addressed in the conception and construction of IoT ser- 
vices to satisfy clients needs |46]. Furthermore, interoperability must be addressed 


in the communication protocols. 


2.7.9 Huge heterogeneous data 


The wide range of devices connected to the IoT generates data of various types, 
sizes and formations. The variation and huge volume of this heterogeneous data 


create a serious challenge in the IoT. 


2.8 Cloud Computing 


2.8.1 Cloud computing definition 


Several industry giants, standardization organisations and researchers have tried 
to define cloud computing in their understandings and opinions. Cloud comput- 
ing is defined by the U.S. National Institute of Standards and Technology (NIST) 
as a model for enabling ubiquitous, convenient, on-demand network access to a 
shared pool of configurable computing resources (e.g., networks, servers, storage, 
applications, and services) that can be rapidly provisioned and released with min- 
imal management effort or service provider interaction [47]. This cloud model is 
composed of five essential characteristics (on-demand self-service, broad network 
access,resource pooling, rapid elasticity and Measured delivery), three service mod- 
els (software as a service, platform as a service, and infrastructure as a service) and 


four deployment models (public, private, community, and hybrid) (Figure 2.4). 
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Figure 2.4: Scheme of NIST Cloud computing definition. 


Cloud computing characteristics [47| 


1. On-demand self-service: A consumer can unilaterally provision computing 
capabilities, such as server time and network storage, as needed automati- 


cally without requiring human interaction with each service provider. 


2. Broad network access: These computing capabilities are distributed over 
the network (e.g. Internet) and used by different client applications using 
heterogeneous platforms (such as mobile phones cell phones, laptops and 


PDAs) located at a consumer site. 


3. Resource pooling: A cloud service provider’s computing resources are ‘pooled’ 
to serve multiple consumers using either the multi-tenancy or virtualization 
model, "with several physical and virtual resources dynamically allocated 


and reallocated based on consumer request". 


4. Rapid elasticity: Capacity may be provisioned and released elastically, in 
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Figure 2.5: Cloud computing service models [48]. 


some with automatic release, to move quickly outward and inward as demand 
dictates. To the consumer, the capacity available for delivery often seems 


unlimited and can be appropriated in any quantity at any time. 


5. Measured delivery: Cloud systems automatically check and optimize resource 
allocation by leveraging a measurement capability at some level of abstrac- 
tion that’ s appropriate for the type of service (e.g, storage, processing, 
bandwidth and active user accounts). Resource usage may be monitored, 
controlled and reported, ensuring transparency for both the provider and 


consumer of consumer of the service used. 


Cloud computing service models [47] 


1. Software as a Service (SaaS): The ability for cloud consumers to use their ap- 
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plications on a hosting environment, which can accessible via networks from 
from various clients (e.g, browser, PDA, etc.) (Figure 2.5). Cloud consumers 
do not control the cloud infrastructure that often uses a multi-tenant system 
architecture, i.e. different applications of cloud consumers are tructured in 
a single logical environment on the SaaS cloud benefit from economies of 
scale and optimization in terms of speed, security, availability disaster recov- 
ery and maintenance.The user is not required any storage, installation and 
maintenance of the application. However, Internet connectivity is needed to 
access the service that is rented by the SaaS service on the cloud. Some 


examples of thees services SalesForce.com, Google Mail, Google Docs, etc. 


2. Platform as a Service (PaaS): The ability offered to the consumer to provide a 
platform for the development of cloud services and applications (Figure 2.5). 
The consumer does not manage or control the underlying cloud infrastructure 
including network, servers, operating systems or storage, but has control 
over the deployed applications and possibly configuration settings for the 


application-hosting environment. 


3. Infrastructure as a Service (IaaS): The ability offered to the consumer is 
to provide processing, storage, networks and other computing resources in 
which the consumer is in a position to implement and run arbitrary soft- 


ware,including operating systems and applications (Figure 2.5). 


A fourth service model, called Networks as a Service (NaaS), was added by Aazam 
et al. [49]. The Networks as a Service (NaaS) represents the ability to provide one 
or more virtual networks to users. The user could have as many network number 
as needed, with the appropriate segmentation and policy enforcement. With NaaS, 


the user may also have heterogeneous networks. 
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Cloud computing deployment Models [47] 


1. Private cloud: The cloud infrastructure is operated only a single organiza- 
tion with multiple consumers. It can be managed by the organization or 
a third party, whether located on-site or off-site premise. There are sev- 
eral aspects to the reason for introducing a private cloud in an organization. 
Security problems, including data privacy and confidentiality, the cost of 
moving data from one infrastructure to another, optimise the use of existing 
internal resources and organizations always demand complete control over 


critical activities that reside behind the cloud. 


2. Community cloud: The cloud infrastructure constructed by multiple orga- 
nizations, which share the same cloud infrastructure as well as policies, re- 
quirements, values and concerns. The cloud infrastructure could be either 
owned by a third party provider or within one of the organizations of the 


community. 


3. Public cloud: The cloud infrastructure composed of two or more different 
cloud infrastructures. Cloud infrastructure that’ s provided for public use. 
The public cloud is used by public consumer cloud and the cloud service 
supplier has complete ownership of the public cloud along with the provider’s 
policy, costing, profit and billing model. Most popular from cloud services 


public clouds, such as Amazon EC2, $3,Google and AppEngine. 


4. Hybrid cloud: The cloud infrastructure composed of two or more different 
cloud infrastructures. Organizations employ the hybrid cloud model to opti- 
mize their resources in order to enhance their core competency and to control 
their core business on-premises via the cloud. The hybrid cloud has raised 


the issues of standardization and interoperability of the cloud. 
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2.8.2 Cloud computing architecture 


The cloud architecture is composed of three layers: infrastructure, platform and 


application [50|(Figure 2.6). 
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Figure 2.6: Cloud computing architecture [50]. 


e Infrastructure layer, is the most basic layer. This layer delivers the pro- 
cessing, storage, networking and other computing resources. Cloud service 
customers can deposit and execute operating systems and software for their 


software to their infrastructure. 


e Platform layer, delivers superior abstractions and services for applications in 
the same integrated development environment. This layer includes an execu- 
tion environment and middleware to support the deployment of applications 


using programming languages and tools cloud service. 
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e Application layer, is the upper layer. The application layer can sense envi- 
ronment data and send requests to the cloud simultaneously to process and 
obtain sensor information results [49]. It is also necessary to re-post infor- 
mation to the IoT, data obtained from the sensor layer and data analysis for 


additional processing [51], [52]. 


2.8.3 Cloud computing challenges 


The integration of internet of things (IoT) and the cloud computing make possible 
the storage, the process and the analysis of the massive IoT data generated by 
the different devices. However, there are challenges that need to be addressed to 
allow the cloud to prevail for the good of the world in general and humanity in 


particular. These challenges will be presented in what follows. 


Security The variety of applications and the heterogeneity of devices in an IoT 
environment made it difficult to ensure the privacy and security of the data gen- 
erated by these devices. To address security challenges in cloud computing, the 


following considerations are important [53]: 
e End-user trust and privacy. 
e Source authentication between nodes. 
e Impenetrable communications between sensors, compute and brokerage nodes. 
e Identification and protection of systems from malicious attacks. 
e Robust data management and tamper resistant databases . 


Current research addresses issues such as malicious detection and recovery, identi- 
fication and protection against attacks, prevention of malicious threats, protection 


of user information against theft and dynamic mutual authentication [54], [55].The 
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existing research are operated at a limited angle and the computational capabilities 


of the two edges and distance resources have not been fully exploited [56]. 


Protocol support In order to get different things connected to the Internet, 
various protocols are going to be used. So, although they might be similar enti- 
ties and operate on different protocols. The solution to this problem can be the 


standardization of protocols. 


Energy ‘The expansion of data collection and processing resulted in an energy 
consumption growth in the cloud data centers of 20 to 25% each year |57|. To 
solve this problem, they directed to the distributed cloud, which conducted to the 
increase in importance of fog and edge computing platforms. As the massively 
expanding number of IoT devices [58], the communication of all devices with the 
cloud result in a higher power consumption. Furthermore, smaller IoT devices with 
low computing power, storage and battery are being developed [59]. For example, 
the change of batteries from time to time in order to power the cameras. Likewise 
the encoding of the videos is more complex than decoding. The point is that for 
an efficient video compression, the encoder has to analyze the redundancy in the 


video |60]. 


Reliability IoT devices are relying on the cloud to operate providers for time- 
critical applications and the impact would directly reflect the program’s output 


(61. 


Resource allocation Resource allocation in distributed systems is a difficult 
challenge in the scale of the current data center. The varying character of net- 
work devices, devices components and communication technologies in large scale 


distributed systems results in the complexity of resource management techniques 


39 


CHAPTER. 2 Internet of Things (IoT) 


growing |62]. In the other hand, due to the variety of devices that this leads 
to the production of different types of data in addition to the amount that will 
be produced, it is difficult to predict the resources they will need in the cloud. 
Several flat forms developed to solve the problem of resource allocation. Such as 
Mesos determines the number of resources to allocate to each network according 
to the constraints, while the latter in turn decide which offers to be accepted. So 
there is a need for new approaches to resource allocation which help to ensure 
the stability and efficiency of these systems. Resource allocation is a critical con- 
cept in distributed systems, however it must ensure that these systems have high 


performance, latency sensitivity, reliability and energy efficiency [63], [64]. 


Quality of service Quality of service (QoS) is a critical challenge in cloud 
computing systems, as it can be predicted by the system performance during run 
time [65]. QoS settings that may be used to measure system performance such as 
execution time, cost, scalability, elasticity, latency and reliability, etc [63]. QoS is 
increasingly important when it takes into account cloud services, because damaging 
the QoS in one of them can dangerously affect the QoS of the complete computing 
system. The following are some of the research challenges cited in [66] that affect 


the realization of QoS efficiently. 


1. The non availability of cloud resources to execute an application during run 
time, which increases the execution time and reduces the system perfor- 


mance. 


2. Making effective resource management mechanisms that take into account 
SLAs (Service Level Agreements) reduces the rate of SLA violations and 


helps to improve the performance of the computing system. 


3. The existence of varying SLA standards for the various cloud providers means 
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that and a centralized SLA standard is needed to attain the goal of a multi- 


cloud environment. 


4. Find the trade-off between the various QoS needs due to the vast amount of 
IoT applications run on cloud systems using supervised/unsupervised learn- 


ing techniques based on AI(Artificial Intelligence) or predictive models. 


2.9 Fog Computing 


2.9.1 Fog computing definition 


The storing and processing of data from various IoT sensors is a critical challenge 
in an IoT system. Traditional cloud based IoT systems are posed with challenges 
due to the large scale, heterogeneity and high latency observed in some cloud 
systems |67]. Consequently, a novel computing paradigm, namely "fog computing", 
has been introduced as a complement to the cloud solution. According to NIST 
[67|, fog computing is a layered model for enabling ubiquitous access to a shared 
continuum of scalable computing resources. The model facilitates the deployment 
of distributed, latency-aware applications and services and consists of fog nodes 
(physical or virtual), residing between smart end-devices and centralized (cloud) 
services. The fog nodes are context aware and support common data management 
and communication system. They can be organized in clusters either vertically 
(to support isolation), horizontally (to support federation) or relative to fog nodes 
latency-distance to the smart end-devices. Data processing tasks that require real 
time processing of data from end devices can be performed by nearby fog nodes, 
leading to low transmission latency [68]. In addition, the fog computing is the 
form of distributed computing which functions as the middle layer between IoT 


devices and cloud data centers [69]. 
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Characteristics of fog computing 


L, 


Contextual location awareness and low latency 


Fog nodes are located in close proximity to loT devices, implying low latency 
services and applications. Moreover considerably more rapid analysis and 


response to data generated by devices compared to a centralized cloud. 


. Geographical distribution 


The services and applications that the fog focuses on need widely distributed 
deployments to ensure QoS for mobile and immobile devices [67]. The fog 
network geographically distributes their nodes and sensors in the scenario 
of a different phase environment, for example, the healthcare monitoring 


system [70]. 


. Very large number of nodes 


The wide geographical distribution, as reflected in sensor networks for the 


most part and the Smart Grid in particular [67]. 


. Large scale sensor networks 


The fog is distributed resources and has a distributed storage that need 


environmental monitoring, in close smart grid applications. 


. Support for mobility Fog applications communicate directly with mobile de- 


vices, so they support mobility techniques. Like the LISP protocol that de- 
couples host identity from location identity with a dispersed indexing system 


[67| 


. Real time interactions 


Fog applications necessitate real time interactions, such as real time trans- 
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10. 


11. 


mission for traffic monitoring systems, control of a sensitive process on an 


oil platform with fog edge devices or sensors...etc. 


. Widespread of wireless access 


The wide range of wireless sensors distributed and connected in the network 
requires distributed analysis and processing. This is why the fog is very well 


suited for wireless IoT access networks. 


. Heterogeneity 


Fog nodes are very varied in their nature and will be employed in a large 
variety of environments which include a variety of devices and have different 


network communication capabilities. 


. Interoperability and federation 


Fog elements need to operate in an interoperable environment to ensure a 
wide range of services such as data streaming and real time processing . 


These services must be federated between domains. 
Support for real time analytics and interplay with the cloud 


The fog nodes are located nearer to the source that generates the data. This 
location provides low latency and low context awareness, however the cloud 
provides global centralization. Analytics big loT data requires the localiza- 
tion of fog for real-time stream analysis and the globalization of the cloud 
for historical lot analysis of big IoT data. Additionally, fog is well adapted 
to handle video streaming in small TV support devices, surveillance sen- 
sors, live game applications and other applications that required low latency 


services in near proximity [71]. 


Scalability and agility of federated, fog-node cluster 
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Fog Computing is adaptive in nature, at the cluster or cluster of clusters, 
with support for elastic computing, resource pooling, data load changes, and 


network state variations, to list some of the adaptive features supported [72]. 


2.9.2 Fog node 


Fog computing employs, in addition to centralized cloud data centers, a vast num- 
ber of smaller capacity resources closer to the edge of the network, termed fog 
nodes |72]. According to NIST [67|, fog nodes are middleware elements within 
the smart terminal and access network. Fog nodes can be physical or virtual and 
are coupled to smart devices or access networks. Fog nodes typically deliver some 
form of data management and communication service between the edge layer in 
which the smart terminals reside and the cloud. Fog nodes, in particular virtual 
nodes, also called cloudlets, can be federated to provide a horizontal extension of 


capability over distributed geolocations. 


Fog node attributes Several attributes are added to the fog node characteris- 


tics to support the deployment of fog computing capability which are |72]: 


1. Autonomy, fog nodes are able to function autonomously, with decisions made 


locally, at the node or cluster level. 


2. Heterogeneity, fog nodes are available in a variety of form factors and can be 


employed in a range of environments. 


3. Hierarchical clustering, fog nodes are adopting on hierarchical structures, 
where different layers providing diverse subsets of service functions and work- 


ing collaboratively like a continuum. 


4. Manageability, fog nodes are managed and engineered by complex systems 


capable of executing most routine most routine functions automatically. 
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5. Programmability, fog nodes are programmably embedded at multiple levels, 


by multiple parties. 


Service Models Same case as the cloud computing, these types of service mod- 


els can be implemented in the fog nodes [72]: 


1. Software as a Service (SaaS): Users are using the fog provider’s applications, 
which they are running on a cluster of federated fog nodes managed by the 
provider. Intelligent objects can be accessed by the fog node’s applications 
via a client or program interface. The infrastructure underlying the fog node, 
such as the network, servers, operating systems, and storage, is invisible to 


the user. 


2. Platform as a Service (PaaS): The fog service provider uses programming 
languages, libraries, services and tools to provide services to clients. Witch 


they may use the platforms of federated fog nodes. 


3. Infrastructure as a Service (IaaS): Users can run arbitrary software, which 
can include operating systems and applications that leverage the infrastruc- 
ture of the fog nodes forming a federated cluster. Users do not monitor or 
control the underlying infrastructure of the fog node cluster, however, they 


do have control over operating systems, storage and deployed applications. 
Deployment models Similar to cloud computing, the following deployment 
models can be applied to fog nodes computing [72]: 
1. Private fog node 


A fog node is operated by a single organization with multiple consumers. 
This node can be independently controlled, managed and operated by the 


organization, a third party, or a hybrid of the two, and it can be located 
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either inside or outside of the organization. 
2. Community fog node 


A fog node that is provisioned by a consumer community of organizations 
that have common concerns. This node can be independently controlled, 
managed and operated by one or more of the organizations in the community, 
a third party or a combination of them, and it can be located either inside 


or outside of the organization. 
3. Public fog node 


A fog node which is provisioned for open use by the public generally. This 
node can be independently controlled, managed and operated by a company, 
university, or government organization, or some combination of them, and it 


can be located either inside or outside of the organization. 
4. Hybrid fog node 


A fog node which is a combination of two or more distinct fog nodes (private, 
community, or public) that remain unique entities, however, they are linked 
by a standardized or proprietary technology that makes data and applications 


portable. 


2.9.3 Fog architecture 


In general, most of the research projects carried out on fog computing have mostly 
represented as a three-layer model [53], [73], [74], [75]. Other teams proposed 
models with four layers |76]|, [77], five layers |78], six layers [49] and seven layers 
[79]. 


In addition, the OpenFog Consortium |80] has developed a detailed architecture 
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reference of the N-layer, which is regarded as an improvement of the three-layer 
model. Another three-dimensional architecture (The device dimension, The system 
dimension and The functionality dimension) is proposed by [81] However, we will 


search for to a three-layer architecture in the following. 


In the three-layer model, fog computing extends the cloud service to the edge of 
the network, in which a layer of fog is introduced between the terminals and the 
cloud. Figure 2.7 illustrates the hierarchical architecture of fog computing. The 


hierarchical architecture is comprised of the three following layers: 


1. Terminal layer: It is the layer nearest to the terminal user and physical 
environment. It comprises a number of widely distribution IoT devices, such 
as smart vehicles, sensors, cell phones and smart cards. They are responsible 
for sensing and transmitting data to the higher layer for processing and 


storage. 


2. Fog layer: It is situated at the edge of the network and largely distributed 
between terminals and the cloud . It is comprised of a variety of fog nodes, 
which typically consists of gateways, routers, access points, specific fog servers 
and switches,etc. This layer is important for the interaction and the collab- 


oration with the cloud layer, which is connected with this layer. 


3. Cloud layer: With multiple high performing servers and storage devices and 
delivers different application services. It has strong computational and stor- 
age abilities in order to take care of profound computational analysis and 


storage of a big loT data. 


2.9.4 Fog computing challenges 


1. Security and privacy 
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Figure 2.7: Architecture of fog computing. 


The security of fog computing devices is challenging due to the fact that 
they are operated in non-strict locations. The protection and monitoring of 
these devices is vulnerable to attacks that could be used to compromise the 
fog device system in order to perform malicious tasks such as data hijacking 
and eavesdropping. The security solutions proposed for the cloud cannot 
support fog computing due to the fact that fog devices operate at the edge 
of networks. The operating environment of fog devices can address many 
threats that do not exist in cloud computing. The major attacks which can 
be launched against fog computing cited in [82] are man in the middle, au- 


thentication, distributed denial of service, access control and Fault tolerance. 
2. Control and management resources 


To ensure the QoS the fog computing should perform a provisioning to pre- 
vent the resources to be used in order to provide the service mobility. The 
major challenge is that the mobility of the end nodes, since these metrics 
such as bandwidth, storage, computation and latency will be modified dy- 
namically [82]. In addition, resource management is a challenge due to the 


fact that the fog computing must manage the sharing and discovery of re- 
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sources used by Cloud applications and manage the sharing of resources used 


by the devices in the terminal layer. 
3. Programming platform 


In the fog computing, the computing is performed in the user end edges 
nodes that are usually probably run heterogeneous platforms and generally 
different from another, so programming in such heterogeneous platforms is 


a major challenge. 
4. Energy management 


Fog computing systems comprise multiple distributed nodes, so energy con- 
sumption is expected to be higher than their cloud counterparts. Hence, 
much effort is required to develop and optimize new energy efficient proto- 
cols and architectures in the fog fog paradigm, e.g., efficient communication 


protocols, computing and network resource optimization [82]. 
5. Fog networking 


The fog network gets heterogeneous, situated at the edge of the network and 
with extensions to the cloud computing functionality. The fog networking 
requirement is to interconnect each required component to the node to main- 
tain and insure the quality of service in the core network connectivity and 
service delivery on all these components. In the increasing use of IoT in wide 


scale use, this use might not be straightforward |70]. 
6. Quality of Service (QoS) 


The quality of service aspect is very important in fog computing and causes a 
challenge, according to Anawar et al.|70] classified these challenges into four 


dimensions, reliability, delay, connectivity and capacity, that are discussed 
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as follows: 


e Reliability, is most important for data transmission and security in the 
core network. In addition, it is required to periodic surveillance by 


means of checkpoints in order to recover from a fault. 


e Delay, in fog computing is a difficult challenge. As the deployment of 
a fog network for applications, which are sensitive to latency, needs a 


real time streaming and processing response. 


e Connectivity, for the fog networking environment requires providing 
partitioning and clustering capabilities for cost minimization, data re- 


duction, and an extension of connectivity methods. 


e Capacity, the capacity of QoS in [70] is categorized into two groups, 
the first group being network bandwidth and the second group being 
storage capacity. These are very important factors in enabling and 
maintaining effective bandwidth and storage operations. In addition, 
real-time response, fog node mobility, and large fog computational vol- 
ume are all factors that must be considered to save maximum band- 


width with low latency. 


2.10 Conclusion 


In this chapter, loT definition, its architecture, the main applications and its chal- 
lenges are provided and discussed in addition to modern cloud and fog computing 


paradigms that have emerged to support the deployment of IoT based applications. 


In front of the great interest of IoT, the presented challenges are to be solved 
especially the huge heterogeneous IoI data challenge. The process of this big data in 


metric space presents the disadvantage of data overlap as well as its process in the 
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multidimensional space. A pre-processing step, in order to diminish data overlap 
during big IoT data store, will be of a great usefulness. The analytics methods 
are supposed be a good candidate for pre-processing big IoT data. Among these 
analytics methods, clustering methods which group objects into homogeneous sets 


or clusters. 
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3.1 Introduction 


The development of big data and IoT is accelerating rapidly and affecting all areas 
of technology and businesses. The growth of data produced via the IoT has played 
a major role on big data. The widespread use of IoT big data has made it difficult 
to analyze. This necessitates development of IoT specific data analytics solutions 


which can handle the heterogeneity, dynamicity and velocity of the IoT data [10]. 


Data mining is the process of extracting useful information or to find out hidden 
relationship among data. This information or knowledge is very helpful for business 
organizations to grow their business as it is helpful in decision making. Data mining 


technology has come across several stages [83]. 


Big data analytics enables data miners and scientists to analyze huge amounts 
of unstructured data that can be harnessed using traditional tools [84]. These 
tools are developed using data mining algorithms, based on a specific scenario, 
such as the prediction method, association rule methods, classification methods 
and clustering methods [85]. Clustering methods are an essential branch of the 
data mining family that has been largely applied in loT applications such as outlier 
detection, finding similar sensing patterns, and segmenting large behavioral groups 


in real time [86]. 
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In this chapter, we present definitions of big data and big IoT data. After that, We 
focus on different clustering methods since one of these methods (Density-Based 
Spatial Clustering of Applications with Noise algorithm) is used, in this work, in 
the pre-processing step which will allow a parallel processing of big IoT data. In 
the end of this chapter, other big IoT data analytics methods will be provided 


after a comparison between the different clustering methods. 


3.2 Big Data Definition 


The usage of the term "big data" officially arrived in the computing field in 2005 
by RogerMagoulus of O’Reilly to depict the massive volumes of data that cannot 
be managed and processed by traditional data management techniques as it gets 
too complex and vast in size [87], [88]. The use of the term "big data" has occurred 
in previous literature, although it is a comparatively new one in business and IT 
(Information Technology) [89]. Several studies related to big data are available. 
One of these studies, "Digital Universe" [90] defines big data technologies as a 
new generation of technologies and architectures that aim to exploit a massive 
volume of data with different formats by enabling high-speed capture, discovery 


and analysis. 


Other studies describe big data in three dimensions 3Vs "volume, variety, velocity" 
[91] are regarded as the essential concentrated when defining big data and it is in 
consistent with Madden’s |92] definition of big data which say: this is data that is 
too big (volume) from different sources, too fast (velocity) as it must be processed 


rapidly, and too hard (variety) to process by existing tools. 


Another approach to defining big data in expanded on the 3 Vs to 4 Vs where the 


aspect of "veracity" for data that is too uncertain [93]. 
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Attribute, "value", is introduced to give it a greater meaning the 5 Vs of big data. 
Importantly, big data is critical to organizations as it facilitates the collection, 
storage management and manipulation of massive amounts of data in order to 
data to make useful decisions. This qualifies the inclusion of "value" as the fifth 
attribute of big value of data collection applies to the intended process or the 
predictive analysis. The value of data is closely associated with other attributes 


of big data such as volume and variety [94]. 


In the preceding definitions, the attributes of big data stay the same in an en- 
terprise network outside of the "variability" that takes the network infrastructure 
into account especially on the integration, evolution of the data and the model 


(variability, which takes care of varying data and associated models.) [95]. 


The appearance of technologies that enable real-time communication with objects 
that move in real time, known as spatial and temporal database models that 
interact, with data representation calls for a more refine approach to description. 
The latter integrates the seventh attribute, which is a "visualization" |96]. This 
attribute ensures the readability and accessibility of data presentations that need 


many spatial and temporal parameters and relationships associated with each other 


[97]. 


Another study added the 8 Vs of big data introduce a " Validity". It may mean 
that the data should be clean, accurate,precise , specific, reliable, valid and useful 
for future processing. Every organization should validate the data if it needs to 
take the right decisions for the future based on the data collected by the devices. 


So, validity is regarded as an important factor for big data [98]. 


In the case of the 9 Vs of big data, "Vulnerability", the data violation is a criti- 
cal concern in today’s age of technology. Hackers are continually and constantly 


hacking into systems and databases to gain access to information. The big data 
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Figure 3.1: Brief description of the 10 Vs of big data. 


violation is a big breach and hence, vulnerability is also a challenging and critical 
characteristic of big data, for securing information against unauthorized persons 


and unauthenticated access is a basic need. 


Further study of big data added the 10 Vs of big data, "volatility" [99]. They 
consider the introduction of "volatility" that is affected by the lifetime of the data 
to answer the following question: How long will the data be regarded as valid and 
how long long it needs to be stored?. The brief description of the 10 Vs of big 
data is shown in figure 3.1. For the case in [100], the researchers presented the big 
data intelligence, which refers to the ensemble of concepts technologies, tools and 
systems which are able to approximate human intelligence in the management and 


processing of big data. 
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3.3 Big IoT Data Definition 


Big IoT data, term appeared as a result of several searches and technological 
development. Recently, Sun et al. [100] added an important feature forming the 
10 Vs. It captures all technologies, systems, platforms and facilities which support 
the big data processes. In this context the 10 Vs of big data capable of integer 
IoT data, where in the area of IoT, the continuous increase in the number of IoT 
devices has led to the production of huge amounts of data. According to statistics 
[101], the number of devices will be increased by 1 trillion by 2030. As these 
devices are numerous, they became a source of big data called "big IoT data". A 
most notable characteristic of IoT is its analysis of data on "connected objects" 
[85]. The analysis of big IoT data needs various methods for the processing and 


the storage of a large amount of IoT data. 


3.4 Clustering Methods 


Clustering, is a data mining technique that is used as a major method of data 
analysis. Clustering employs an unsupervised learning approach and generates 
groups for given objects based on their distinctive significant features [102]. Clus- 
tering is the process of grouping a set of physical or abstract objects into classes 
of similar objects. A cluster is collection of data objects that are similar to one 
another within the same cluster and are dissimilar to the objects in other clus- 
ters. Clustering algorithms can be categorized into partitioning-based algorithms 
hierarchical-based algorithms, grid-based algorithms, model-based algorithms and 


density-based algorithms [103]. 
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3.4.1 Partitioning clustering algorithm 


Splits the data points within k partitions. Each partition is considered as a cluster. 
Partitioning is performed based on some objective functions. One of these function 


minimizes the square error criteria which is calculated as: 


B=S> 30 |p—mill? (3.1) 


where p the point of a cluster and m; the mean of the cluster. Among these 


partitioning methods, we can cite k-means [104] and FCM (Fuzzy CMeanS) [105]. 


k-means [10] Divides a given data set into k different clusters. This is done 
by first, choosing k random points in the dataset as the initial clusters centroids, 
then, allocate each data point to the appropriate of these clusters by adjusting 
the center. The process is repeated with the output as new input arguments until 
the centroids converge to stabilized points.As the final clustering results are highly 
dependent on the suitable centroids, the whole process is performed several times 
with different suitable initial parameters. For a fixed dataset size it might not be 
a problem, however in the context of IoT data the characteristic of the algorithm 
causes significant computational overhead. k-means convergence for clustering 
with randomness not only this process need time, it also means that k-means may 


produce lower quality. 


Fuzzy C MeanS (FCM) [105] FCM is based on the k-means concept to par- 
tition the data set into clusters. The procedure is as described below: 

Compute the cluster centroids and the objective value and initialize the fuzzy 
matrix. Compute the membership values stored in the matrix. If the objective 
value between consecutive iterations is less than the stopping condition, stop. This 


process is continuous until a partition matrix and clusters are formed. 
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3.4.2 Hierarchical clustering 


A technique of clustering which divides the similar dataset by building a hierarchy 
of clusters. This method is based on the connectivity approach of clustering algo- 
rithms. It is based on the distance matrix criteria to group the data. It constructs 
clusters step by step. In hierarchical clustering, there are two approaches: Clus- 
tering agglomeration (top-bottom) and Division (bottom-up). In agglomerative 
approach, suitable object is selected and successively merges neighboring objects 
according to the distance to the minimum, maximum and average. The process 
is continuous until a desired group is formed. The division approach deals with 
set of objects as a single cluster and divides the cluster into other clusters until 
the desired number of clusters is formed [103]. Among these algorithms we can 
cite: Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) |106], 
Clustering Using REpresentatives (CURE) [107] and Robust Clustering algorithm 
for Categorical attributes (ROCK) [108]. 


Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) 
[106] It is an agglomerative hierarchical clustering algorithm especially adapted 
for very large large databases [109]. The BIRCH process starts by building CF- 
tree. Condense the data by rebuilding the CF-tree with a larger tree. Then one of 
the existing clustering algorithms is used on the CF-tree leaves. After additional 
passages performed on the data set and reassign the data points to the centroids 


closest to step above. This process continues until & cluster steps are formed. 


Clustering Using REpresentatives (CURE)[107] It is an agglomerative hi- 
erarchical clustering method that creates a balance between centroid and all point 
approaches.[109]. A Divisive approach hierarchy is used and it selects well dis- 


persed points of the cluster and then shrinks to the cluster center by a specified 


58 


CHAPTER. 3 Clustering Methods of Big IloT Data 


function. Adjacent clusters are consecutively merged until the number of clusters 
reduced to the desired number of clusters. The procedure is given by: Initially 
every point is in separate clusters, each cluster is defined by the point in it. The 
representative points of a cluster are generated by first selecting well dispersed 
objects for the cluster and then shrinking or shifting to the cluster by a speci- 
fied factor. At each step of the procedure, two clusters with the closest pair of 


representative points are selected and merged together to form a cluster. 


Robust Clustering algorithm for Categorical attributes (ROCK) [108] 
It is an agglomerative hierarchical clustering algorithm based on the notion of links 
[109]. It is a hierarchical clustering algorithm where forming clusters, it uses a link 
strategy. Links from bottom to top merge to form a cluster. The procedure is given 
by: First considers a set of points in that each point is a cluster and calculate the 
links between each pair of points. Build a heap and maintain the heap for each 
cluster. A quality measure based on the criterion function be computed between 
the pairs of clusters. Merge the clusters that have a maximum value of criterion 


function. 


3.4.3 Grid-based algorithms 


The grid based algorithm is based on partitioning the dataset into number of cells 
to form a grid structure. The clusters are formed based on the grid structure. 
To build clusters, the grid algorithm uses subspace and hierarchical clustering 
techniques [103]. Among these algorithms we can cite: STatisitcal Information 
Grid based method (STING) [110], CLustering InQUEst (CLIQUE) [111] and 
Merging of Adaptive Intervals Approach to SpatialData Mining (MAFIA)) [112]. 
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STatisitcal Information Grid based method (STING) [110] It is simi- 
lar to the BIRCH hierarchical algorithm [106] for building a cluster with spatial 
databases. The process starts by stored the spatial data in rectangular cells using 
a hierarchical grid structure. Then each cell is partitioned into four child cells at 
the next level with each child corresponding to a quadrant of the parent cell. The 
probability is computed of each cell being relevant or not. If the cell is relevant, 
apply the same calculations on each cell one by one. Finlay, find the regions of the 


relevant cells to form a cluster. 


CLustering InQUEst (CLIQUE) [111] A subspace clustering algorithm of 
numerical attributes where the bottom-up approach is employed to build clusters. 
The algorithm is described in this way: Consider a set of data points,in a one 
pass, apply width to the set of points to form the grid cells. Rectangular cells 
in a subspace whose density exceed 7 are placed in equal grids. The process is 
continued recursively to form (q — 1) dimensional units into q dimensional units. 


The subspaces are connected to each other to form cluster of equal width. 


Merging of Adaptive Intervals Approach to SpatialData Mining (MAFIA)) 
[112] It is a variant of the CLIQUE algorithm [111]. Unlike the CLIQUE algo- 
rithm, it uses a fixed cell size grid structure with an equal number of cells. Using a 
grid structure of fixed size cells with an equal number of cells in each dimension of 
bins in each dimension, it constructs an adaptive grid to improve the quality of the 
clustering. The algorithm is described in this way: In a single pass, an adaptive 
grid structure was built by considering a set of all points. Calculate the histogram 
by reading blocks of data in memory using bins. The bins are grouped based on 
dominance factor a. Choose the bins that are r a times denser than the mean as 
p candidate dense units (CDUs). Recursively, the process continues to form new 


p-CDUs and merge adjacent CDUs into clusters. 
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3.4.4 Model-based algorithms 


In a data collection, data points are connected with each other according to differ- 
ent strategies such as statistical methods, conceptual methods and robust cluster- 
ing methods. Two approaches to model-based algorithms are available: the neural 
and the statistical approach |103]. Among these approaches, we present Self Orga- 
nized Map algorithm (SOM) [113] as neural approach and Model based clustering 
algorithm (COBWEB) |114] as statistical approach. 


Self Organized Map algorithm (SOM) [113] Neural networks consider each 
cluster as a neuron, and the input data are also considered as neurons. Each 
neuron connection is assigned by some weight, that is randomly initialized before 
learning these weights in an adaptive manner [115]. SOM [113] is one of the most 
widely used algorithms. The SOM is considered as a two layer. Each neuron 
represented by n-dimensional weight vector, m = (m,...,7%n), where n is equal 
to the dimension of input vectors. The neurons of the SOM are itself cluster 
centers, hoverer to accommodate interpretation the map units can be combined to 
form bigger clusters. The SOM is trained iteratively. In each training step, one 
sample vector x from the input data set is chosen randomly. The distance between 
it and all the weight vectors of the SOM is calculated using a distance measure. 
After finding the best matching unit, the weight vectors of the SOM are updated 
so that the best matching unit is moved closer to the input vector in the input 


space |116]. 


Model based clustering algorithm (COBWEB) [114] The COBWEB algo- 
rithm gives a clustering dendrogram called classification tree which characterises 
each cluster by a probability description [117]. It is known as an incremental 


learner since when a data object is entered, the nodes of the tree are restructured. 
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In some situations, it can change the entire structure of the tree considerably. 
COBWEB uses the category utility (CU) measure [118] as the criterion function 
for determining partitions in the hierarchy. The expected value of CU used in 
COBWEB is defined as P(A; = V;; \ C;)?. If the the given data object is in a 
cluster C;,, the CU value implies the probability that A; as the value V;; which 
signifies the probabilistic match of the data object to the cluster. The role of cat- 
egory utility is to make a trade-off between maximizing intra-class similarity and 


inter-class dissimilarity. CU was used as the basis for incremental clustering [118]. 


3.4.5 Density-based algorithms 


Density-based clustering algorithms try to find clusters based on density of data 
points in a region [116]. The main idea of density based clustering is that for each 
instance of a cluster the neighborhood of a given radius eps has to contain at least 
a minimum number of instances MinPts [116]. The data objects are categorized 
into: core points, border points and noise points. All core points are interconnected 
based on the densities to form a cluster. They can find the cluster based on the 
regions that are growing at high density. These are the one-scan algorithms. It is 
capable of getting the arbitrary shaped clusters and handle the noise. One of the 
most well known density based clustering algorithms is the Density-Based Spatial 
Clustering of Applications with Noise (DBSCAN) [119] and the Ordering Points 
To Identify the Clustering Structure(OPTICS). 


Density-Based Spatial Clustering of Applications with Noise (DBSCAN) 
[119] It is a connectivity based algorithm which consists of three points namely 
core, border and noise (Figure 3.2). Given a distance threshold r in DBSCAN the 
distance threshold is named eps and a density threshold k in DBSCAN the density 
threshold is named MinPts . 
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1. The density of a point x; is defined as the number of points k; that are within 


a radius r around 2;. 
2. If k; > k, the corresponding point x; is considered a core point. 


3. Two points are considered directly connected if they have a distance of less 


than r. 


4. Two points are density connected if they are connected to core points and 


these core points are in turn density connected. 


5. A border point has less MinPts in eps, this point is is in the neighborhood 


of a core point. 


6. A noise point is defined as any point that is not a core point nor a border 


point. 


These definitions allow to define the transitive hull of density-connected points, 


forming density-based clusters. 


MinPts=3 
eps=2 Units 
i? 2 
Noise 
Core Point 


Figure 3.2: DBSCAN algorithm based on eps and MinPts [120]. 
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The algorithm is described by: 
1. Set of points to be regarded to form a graph. 


2. Create an edge from each point c to the other point of the neighborhood of 


eG; 
3. If the set of nodes N contains no center points, then terminate N. 
4. Select a node X that can be connected from c. 


5. Repeat the procedure until all center points form a cluster. 


Ordering Points To Identify the Clustering Structure (OPTICS) [121] 
The difficulty of finding density-based clusters with widely differing densities has 
also motivated hierarchical procedures for computing clusters at different density 
levels in a single pass [122]. Due to the fact that the connected components 
of different density levels are either disjoint or the cluster of higher density is 
entirely contained within the lower density cluster, the result of such hierarchical 
algorithms can be represented as a tree. Practical approaches for density-based 
hierarchical clustering include OPTICS [123]. OPTICS is an extension of the 
DBSCAN algorithm that is also based on the same parameters as the DBSCAN 


algorithm. The algorithm is as follows: 


1. Select a point from the set of points which is a center point if at less Minpts 


are within the base distance. 


2. For each point c create an edge from c to another point with a center distance 


of ¢. 


3. Select a set of nodes that contain center points as a cluster that extends from 


C. 
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3.5 Comparison Between Clustering Techniques 


The comparison of various clustering algorithms are given in table 3.1. Under 
partitioned clustering method, k-means clustering dominates and is still the most 
popular clustering method. In k-means, even if an object is quite far away from the 
cluster centroid, it is still forced into a cluster and thus, distorts the cluster shapes 
[124] which produce lower quality of clusters. The final results of the clustering is 
heavily dependent on the initial centroids. The whole process is carried out several 
times with different initial parameters. For a data set of fixed size this might not 
be a problem. However, in the context of big IoT data, this characteristic of the 
algorithm leads to heavy computational overload [10]. The k-means algorithms 
have problems like defining the number of clusters initially, susceptibility to local 
optima and sensitivity to outliers, memory space and unknown number of iteration 
steps that are required to cluster. The fuzzy C means clustering is really suitable 
for handling the issues related to understand ability of patterns, incomplete/noisy 
data, mixed media information, human interaction and it can provide approximate 


solutions faster [125]. 


The main motivations of BIRCH lies on two aspects, the ability to deal with large 
data sets and the robustness to outliers [126]. Also the BIRCH can achieve a 
computational complexity of O(N) where N is the number of objects. ROCK not 
only generates better quality clusters than traditional algorithm, it exhibits a good 
scalability property [107]. BIRCH and CURE both handle outliers well but CURE 
clustering quality is better than that of BIRCH. On the reverse, in terms of time 
complexity, BIRCH is better than CURE as it attains computational complexity 
of O(N) compared to CURE O(N? log N). The CIIQUE algorithm combines the 
advantages of density and grid methods. It divides data not only based on the 


grid, but also takes density into account. It partitions data into dense and sparse 
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sets and focuses on dense grid cell data. However, CLIQUE algorithm divides each 
dimension equally according to the user setting. It may lead to a cluster being 
divided into several artificial clusters. In addition, the number of connections 
will grow exponentially and the computational complexity will be very high at 
high-dimensional data sets [86]. The performance results show that MAFIA is 
40 to 50 times faster than CLIQUE due to the use of adaptive grids. MAFIA 
introduces parallelism to obtain a highly scalable clustering algorithm for large 
data sets [116]. DBSCAN (density-based spatial clustering of applications with 
noise), which discovers clusters of arbitrary shapes and is efficient for large spatial 
databases [125]. OPTICS find clusters of fixed density. It ensures good quality 
clustering by maintaining the order in which the data objects are processed, i.e., 


high density clusters are given priority over lower density clusters [116]. 
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Table 3.1: Comparison of various clustering algorithms 
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In table 3.1, N is the number objects, m is the number of initial sub-clusters 
produced by the graph partitioning algorithm, m,, is the maximum number of 


neighbours for a point and m, is the average number of neighbours for a point. 


In the partitionel algorithms, the common criterion is relatively scalable and sim- 
ple. It consists on finding the Euclidean distance between points and the center 
of the available clusters and assigning each point to the cluster with minimum 
distance [125]. However, these algorithms include poor cluster descriptors. They 
lie on the user to specify the number of clusters in advance. They present high 
sensitivity to initialization phase, noise and outliers. They are unable to deal with 
non-convex clusters of varying size and density [125]. In addition, they give bad 


result caused by the overlapping of data points [130]. 


Hierarchical method is based on the distance between objects and clusters. The 
idea of hierarchical methods is that objects are more related to nearby objects 
rather than the farther objects. The major problems which commonly occur in 


Hierarchical clustering algorithms are [128]: 
e No objective function is directly minimized. 
e Sensitivity to noise and outliers. 
e Difficulty in handling different sized clusters and convex shapes. 
e Difficulty in breaking large clusters. 


The grid based clustering algorithms such as STING and CLIQUE has the ability 
to decompose the data set into various levels of details. The evolutionary ap- 
proaches for clustering start with a random population of candidate solutions with 
some fitness function, which would be optimized [125]. The additional advantage 
is its fast processing time [131], no need of distance computations and easy to de- 


termine which clusters are neighbouring. The model-based method is hypothesized 
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for each of the clusters and tries to find the best fit of that model to each other 
[116]. Model-based clustering algorithms are far less common than partitioned 
and grid based algorithms. Unfortunately, no implementation of model based al- 
gorithms is readily available which limits their usefulness in practice. In addition, 
they are often more computationally complex than comparable algorithms from 


the other categories [132]. 


In the density based clustering methods, the data space is composed of dense 
regions separated by regions of lower object density and a cluster is defined as a 
maximal set of density-connected points [125]. Density based clustering logarithms 
are used to form clusters of high quality with acceptable time complexity. They 
also give strategy to filter noise from real data. They are robust in finding clusters 


with different densities. They are suitable for high dimensional data and big data. 


3.6 Other Big IoT Data Analytics Methods 


In addition of clustering methods, several solutions are currently offered for the 
analysis of big data and big IoT data (Figure 3.3). These solutions need the 
same or higher processing speed than traditional data analysis with minimum 
cost for high volume, high velocity and high variety data [133]. These solutions 
are continuously developed to adapt them to the new developments in big IoT 
data. The exploration of IoT data has an important role in analytics and, like 
the clustering methods, the majority of the techniques are developed using data 
mining algorithms, based on a specific scenario, such as the prediction method, 


association rule methods and classification methods. 


69 


CHAPTER. 3 Clustering Methods of Big IoT Data 
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Association Rule 


Figure 3.3: Big data analytics methods [85]. 


3.6.1 Prediction method 


Predictive analytics employs the historical data, referred to as training data, to 
determine the results as trends or attitudes in the data.In big data analytics, pro- 
cessing demands are changed depending on the nature and volume of data. Rapid 
data access and exploration methods for structured and unstructured data are im- 
portant concerns related to analysis of big data. In addition, data representation 


is an important requirement in big data analysis [85]. 


3.6.2 Association rule method 


Associationn rule mining is centered on the identification and generation of rules 
based on the occurrence frequency for numeric and non-numeric data [134] . The 


data is processed in two ways The first way, sequential data processing, uses a 
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priori algorithms, such as a priori algorithms, such as MSPS (Maximal Sequential 
Patterns by using multiple Samples) [135] and LAPINSPAM (LAst Position IN- 
duction Sequential PAttern Mining) [136], to identify interaction associations. The 
second way of processing the data according to the association rule is temporal 
sequence analysis, that uses algorithms to analyze patterns of events in continuous 


data. 


3.6.3 Classification method 


In supervised classification the class (label) of an object is predefined. The major 
objective of the classification approach is to develop a tool or algorithm, that can 
be used to predict the class of an unknown object, which is not unlabeled. This 
tool or algorithm is named a classifier. The objects in the classification process 
are more commonly represented by instances or patterns. A pattern consists of 
a number of features (also known as attributes). The classification precision of 
a Classifier is evaluated by the number of test patterns it has classified correctly 
[125]. One of the ways of classification SVM (Support Vector Machines), is based 
on the theory of statistical learning to recognize patterns in the data and generate 
groups. In the same way, K Nearest Neighbor (kKNN) is usually mechanisms for 
retrieving patterns from large datasets, so that the recovered objects are similar 


to the predefined category [137]. 


However to find unknown or hidden patterns is more difficult for loT big data. 
Also, extracting precious information from big data sets to improve decision mak- 
ing is a very critical task. In additional there are usually huge amount of data 
produced in IoT applications, however, these data lack having labels, which makes 


these types of methods infeasible to be used in IoT environment [10]. 
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3.7 Conclusion 


Compared with data analytics methods namely prediction method, association rule 
method and classification method, clustering methods present interesting charac- 
teristics speciality DBSCAN (Density-Based Spatial Clustering of Applications 
with Noise) and OPTICS (Ordering Points To Identify the Clustering Structure) 
algorithms which have surpassed the k-means algorithm in term of homogene- 
ity quality of clusters. The use of DBSCAN or OPTICS algorithms, as a pre- 
processing step of IoT data store, will result in the creation of clusters of high 
homogeneity in term of data type and dimension. These homogeneous clusters 
will reduce data overlapping which represents a serious challenge often encoun- 
tered during big IoT data indexing in both metric and multidimensional spaces. 
Another advantage of these clustering methods is the possibility of introducing 


parallelism in both big IoT data indexing process and similarity query search. 
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4.1 Introduction 


Indexing is a widely used method to store big IoT data and to provide fast answer 
when searching similarity queries. Indexing methods were proposed as solution of 
the challenge of processing and storing big IloT data so that the similarity query 
search proceed efficiently. This challenge came from the production, by various 
connected devices in IoT architectures, of different types of data in large volumes 
at very high speeds. In this chapter, we will present most indexing methods in 
both multidimensional and metric spaces (Figure 4.1). For both spaces, indexing 
methods will be grouped into two groups: centralized methods and distributed 
methods. At last, a comparative analysis of these methods in term of advantages 
and disadvantages will be presented. In this chapter, a particular interest is given 


to the indexing methods in metric space regarding their link with this work. 


4.2 Multidimensional Space Indexing Methods 


Indexing techniques in multidimensional spaces can be categorized into two main 
types depending on the structure: hashing methods and tree methods. The hash- 
ing methods are regrouped into Locality sensitive hashing methods (LSH) and 
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Learning to Hash methods (L2H) (Figure 4.2). 


Big IoT Data Indexing 


Indexing methods 


Metric space 


Tree methods 


Hashing methods Tree methods 
Centralized Distributed Centralized Distributed 
methods methods methods methods 


Distributed 
methods 


Centralized 
methods 


Figure 4.1: Global taxonomy of IoT data indexing methods. 
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Figure 4.2: Hashing methods in multidimensional space. 
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4.2.1 Hashing methods 
4.2.1.1 Locality Sensitive Hashing methods (LSH) 


Hashing is an approach that consists in transforming the data object into a low- 
dimensional representation, or in an equivalent manner into a small code of bits. 
Locality Sensitive Hashing methods (LSH) and their variants are widely used meth- 
ods |138][139]|140][141]. The LSH scheme has been first introduced by Indyk et al. 
[138] to be applied in the binary Hamming space {0, 1}¢ and later extended to be 


used in the Euclidean space R¢ by Datar et al. [139]. LSH maps the points in the 


data set to buckets in hash tables by using a set of predefined hash functions that 
are designed to be locality sensitive so that close points are hashed to the same 
bucket with high probability [142]. In this work, the presented LSH methods are 


regrouped into centralized methods and distributed methods. 


4.2.1.1.a Centralized methods 


e Collision Counting LSH (C2LSH) is a scheme which can ensure the quality 
of the query by selecting the size of the LSH function appropriately and the 
collision threshold dynamically [143]. 


e Query-Aware Locality Sensitive Hashing (QALSH) [144] utilized two tech- 
niques to improve upon accuracy. The first technique introduced query-aware 
hash functions by creating a B+-tree on each random projection. The sec- 
ond technique performing incremental range queries until top-k candidates 


are found. 


e PDA-LSH (Projection Distance Aware LSH) is a locality sensitive hash method 
proposed to speed up the approximate c-ANN search with a low cost of index 
maintenance [13]. It is based on the use of the LMS-tree [145] which indexes 


the pairs (0;,7d) of ith projection of an object with their identifier. 
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e PM-LSH [142] in this method, the points are transformed into a low-dimensional 
space, called the projected space. The coordinates of a point in the projected 
space are the point’s hash values. They use PM-tree to organise these points. 
However, this index presents the difficulty of estimating the original distance 
between the points after the results of a query are obtained. The storage 
space consumption by the hash tables. The complexity of the query search 
needs the computation of the hash function of a query in addition to the 


computation of probability on the candidate points. 


The above mentioned methods use a number of hash tables that are necessary to 
ensure the quality of the search. However, because of the limitations of storage 


space and server processing ability, centralized indexing schemes are not feasible. 


LSH enables sublinear search time in high dimension, but usually requires long 
hash codes. To generate compact codes, it is realized that hash functions should 
be adapted to data distribution because indexing schemes become impractical for 


large data objects [146]. 


4.2.1.1.b Distributed methods 


e Near bucket-LSH this method, composed by Kraus et al.[147], integrated 
LSH and cosine similarity metrics in a P2P Content Addressable Network 
(CAN) architecture to enhance network efficiency when searching near buck- 


ets. 


e LSH-based Fusion Features for Image Retrieval (LFFIR) Liao et al.[148| 
introduced a distributed image retrieval framework for similar search content, 
which can effectively integrate image retrieval based on multi-functional in 


the Chord P2P network into a cloud data center. 


e Decentralized Search for Large and Mobile wireless networks (DSLM) is a 
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proposed system for large mobile wireless networks, where it divides the 
entire network in to smaller regions. Allowing nodes to join, leave, distribute 
metadata or make requests. Next the LSH function is used to distribute 
similar group documents to the same region. Then a geographic routing 
method based on the region is applied to route messages messages between 
nodes [149]. However, the LFFIR and DSML indexing schemes insufficiently 
account for the load balancing problem, that is one of the key issues on the 


overall performance of the distributed system [150]. 


e A hashing method to obtain a balanced distributed p2p network is proposed 
in [150]. The concept of virtual node is used to adapt to dynamic changes 


in data load and network environment. 


Nevertheless, the LSH family needs a longer code length to ensure search perfor- 
mance, which leads to a higher storage cost, thus limiting the scalability of the 
overall algorithm [151]. In additional, as LSH originally was developed to find 
objects in a fixed radius, to guarantee the quality of the intended to search for 
objects in a fixed radius, to unsure the quality guarantee, it needs to construct 
indexes for different radii. Thus, in this case, hundreds or thousands of hash tables 


are constructed, which results in high space and search costs [142]. 


4.2.1.2 Learning to Hash methods (L2H) 


To minimize the search cost in LSH methods, Learning to Hash methods (L2H) 
have been proposed for their ability to learn similarity by preserving hash func- 
tions adapted to a given data set. These methods are classified into four categories 
according to the degree of supervision, namely: unsupervised hashing, supervised 
hashing, semi-supervised hashing and deep hashing [152]. According to the litera- 
ture [153] [154], L2H methods adopts Hamming ranking (HR) as a query technique 
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that probes the buckets in ascending order of their Hamming distance to the query. 


4.2.1.2.a Centralized methods 


e Dynamic Multi-view Hashing (DMVH) [155] capable of adaptively increas- 
ing hash codes according to dynamic changes in the image. These hashing 
techniques also use multi-view features to achieve more efficient hashing per- 


formance. 


e Robust Discrete Spectral Hashing (RDSH) [151] is a hashing approach to 
facilitate large-scale semantic indexing of image data. It is aimed at learning 
a set discrete binary codes and robust hash functions within a unified model. 


This approach is not adequate for a large and dynamic databases [152]. 


4.2.1.2.b Distributed methods 


e Distributed Indexing by Sparse-Hashing (DISH) is a distributed kNN in- 
dex for cloud-based systems, based on sparse hashes [156]. It can help to 
overcome challenges related to large-scale index distribution associate vec- 
tors to several index nodes based on their orthogonal similarities and search 
for large-scale distributed images. DISH supports distribute documents and 


queries in a balanced and redundant way between nodes [156]. 


e Supervised Distributed hashing (SupDisH) is an efficient method that learns 
discriminative hash functions by taking advantage of the semantic informa- 
tion of the labels in a distributed manner [157]. The distributed hash prob- 
lem is discussed in the context of classification, where it is expected that the 


learned binary codes are distinct sufficiently for semantic retrieval [157]. 


LSH methods operate with the predefined hash functions regardless of the underly- 


ing dataset, where L2H learns custom hash functions based on the dataset. While 
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there is an additional training step necessary, some studies have shown experi- 
mentally that L2H outperforms LSH in terms of query efficiency [158], [159],[{160]. 
However, the Hamming distance is a gross indicator of the similarity between the 
query and the elements of a bucket as it is discrete and has a limited number of 
values. Consequently, Hamming ranking may not define a good order for buckets 
having the same Hamming distance from the query. As a consequence, HR gener- 
ally probes a large number of unfavorable number of adverse buckets, leading to low 
efficiency. A solution is to employ a long code so that the Hamming distance can 
classify buckets into larger categories. However, the long code has challenges such 
as time-consuming sorting, high storage demand and low scalability, in particular 


for large-scale datasets |161]. 


4.2.2 Tree methods 
4.2.2.1 Centralized methods 


The centralized indexing techniques, in the multidimensional space, can be clas- 
sified in two major approaches: the space partitioning, which uses space cells to 
index the data, and data partitioning which uses cells of similar objects (approxi- 


mation function) to index the data (Figure 4.3). 


4.2.2.1.a Space partitioning methods 


Several previous studies have focused on indexing through tree structure that relies 
on the successive division of space. In the multidimensional space. Among these 


methods of space partitioning, we can cite the following: 


K-dimensional tree (Kd-tree) is a static method based on the partitioning 
of space into A dimensions |162]. It is based on the division of a dataset into two 


equal subspaces (Left, Right) by the median m of the dataset. 
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Multidimensional space indexing methods 
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Figure 4.3: Centralized tree methods in multidimensional space. 


This process of dividing the spaces repeats recursively. The data is structured in 
the form of a binary tree. The principal disadvantage of the Kd-tree is that it 
is unbalanced since the hyperplane space divider does not partition planes in the 
best place. This creates overlaps between neighboring regions, and this causes the 
cost of I/O operations to be higher [163]. Additionally when searching for a kNN 
query in high-dimensional spaces, most of the points in the tree will be traversed 
and the efficiency is not better than with an exhaustive search [164], [165]. On the 
other hand, partitioning the space using hyperplanes in the Kd-tree in situations 
where the query point is close to the boundary between two neighbor regions. 
It is necessary to visit the two neighbor regions, This affects the response time 


negatively [162], [166]. 


KdB-tree [166] is a combination of Kd-tree [162] and B-tree[167]. It is a dy- 


80 


CHAPTER. 4 Big IoT Data Indexing 


namic structure and balanced tree. It is proposed to improve I/O performance of 
Kd-tree [162]. The KDB-tree can’t ensure minimal storage consumption. It only 


considers point in time data and insufficient search performance. 


Quad-tree [168] partitions the two-dimensional space into quadrants and con- 
sists of many quadrants and comprises various partition index spaces. It is not 
balanced as is not selecting the optimal division of the space as a Kd-tree . More- 
over, Quad-tree ignores the distribution of the data in space partitioning process 


[169]. 


4.2.2.1.b Data partitioning methods 


This approach is based on the partitioning of data by the way of packages grouping 
data, also called "enclosing forms". These methods can be classified into two 
classes, the first one based on the grouping of objects into rectangles of minimum 
delimitation (hyper-cubes). The second class are the methods that are based on 


the grouping of objects in regions of minimal delimitation (hyper-planes). 


Hyper-cubes regrouping methods 


1. R-tree proposed first by Guttman |12]. It is considered as one of the first 
methods that indexes data in multidimensional spaces in the form of a bal- 
anced hierarchical partitioning into sets of rectangles called Minimum Bound- 
ing rectangles (MBRs). It is a spatial access method used to index geographic 
coordinates. Minimum Bounding Rectangles (MBR) is determined by a pair 
of vectors where the components of the first vector are two less than or 
equal to those of the second vector. This pair of coordinates defines the 
smallest volume that encloses a given set of points and/or geometric forms. 


R-tree is efficient structure for range queries [170] dynamic [12] and balanced 
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[171] However, it suffers from the problem of overlap between the rectangles, 
which leads to a difficulty in finding the objects in these rectangles. In higher 
dimension the time and the complexity of the computations is augmented 
[172]. In additional it rapidly degrades for higher dimension [173].For its 
improvement, many methods have been proposed. R*-tree [174] eliminates 
the overlapping rectangles by their dividing until all overlaps are eliminated 
however, it causes the increase of the tree height. R*-tree [175] minimize the 
overlap of rectangles by inserting a few child nodes before dividing a node. 
It improves partitioning by aggressively reinserting data objects leading to 


a more efficient search performance |176]. 


2. eXtended node tree (X-tree) [177] is an additional multidimensional index 
that is similar to the R-tree which aims to limit the problem of overlapping 
forms. It adopts a completely different strategy for partitioning nodes, which 
are extended with variable sizes, called extended nodes. It is enhances the R*- 
tree by introducing overlap-minimizing splits for the objects that caused the 


overlap. Eventually degenerating to a sequential scan |173]. 


3. Sphere and Rectangle (SR-tree) [172] is based on the grouping of objects 
into regions, where each region is the intersection of a hyper-rectangle and a 
hyper-sphere. The idea is that intersections of these shapes give small areas, 
which avoid overlapping. However, this solution is very complex to build 


these shapes and to find the intersection areas. 


4. Subspace based High-dimensional Indexing (SUSHI-tree) [173] is proposed 
to index high dimensional objects. The space is divided into subspaces by 
clustering. The internal nods are clusters defined by a list of upper and lower 
bound values for the relevant dimensions of the cluster and a pointer to its 


corresponding child node representing the objects. However, in this index 
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the division of the space into subspaces does not guarantee that each cluster 


contains objects of the same type. 


5. Time Parameterized R-tree (TPR-tree) [178] is a variant of the R*-tree for 
processing movement data. It supports queries for present and future posi- 
tions of moving objects. Moving objects are enclosed in a bounding box that 
does not shrink. The position and period of an object are implemented by a 


function. 


6. TPR*-tree [179] is an improvement of TPR-tree. It added insertion and 
deleting to the TPR-tree. 


7. Decomposition Tree (D-tree) [180] is a virtual tree without internal nodes 
used for the indexing of the multidimensional motion. Which are replaced 


by an encoding method based on integer bit-shifting operation. 


8. Bubble Buckets tree (BB-tree) [176] is based on the combination of the Kd- 
tree structure [162] and X-tree |177]. It recursively partitions the data space 
into k partitions, its leaf nodes store objects in elastic buckets named Bubble 
Buckets (BB). Each BB contains m-dimensional objects. Similarity search 


queries are not considered in the structure. 


Hyper-planes regrouping methods 


1. B-tree [167| in a binary tree, each node of order d contains at most 2d keys 
with 2d + 1 pointers. The search in a B-tree depends on the branch from 
the node to the query key. When the query is less than the saved key, the 
left branch is chosen, if the key is greater, the right branch is chosen. The 
drawbacks need a linear space for storage and logarithmic time for the basic 
operations of insert and find. In addition, the B-tree only works well for 


one-dimensional data [181]. 
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2. Bt-tree [182] tree is an m-ary tree that has a varied and usually large number 
of children per node. A Bt-tree consists of a root, internal nodes and leaves. 
The root can be either a leaf or a node with two or more children. Many 


indexing schemes are based on the Bt-tree. 


3. STCB-tree |183] indexes the trajectories of the motion in the past, the present 


and anticipates the future. 


4. UB-tree (Universal B-tree) [181] is an improvement of the B-tree. It organizes 
the objects in an n-dimensional space (called universe) so that they can be 
stored, managed, retrieved from and deleted from from and deleted from 


peripheral storage very efficiently. 


4.2.2.2 Distributed methods 


The distributed methods, cited in what follows, are grouped in figure 4.4. 
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Figure 4.4: Distributed tree methods in multidimensional space. 
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General indexing framework _ is the first index that was presented for indexing 
data in the cloud system based on the overlay network [184]. This index has 
two layers, with the servers organized in the overlay network, thus every server 
constructs their own local index for accelerating retrieval of data. The global 
index is constructed over the local index by choosing part of the local index and 
publishing it to the network. The global index is used to provide an overview of 
the local index. While this index scheme is scalable and flexible, the Peer-to-Peer 


structure is not well suited for cloud systems [185]. 


A-tree proposed, first, by Papadopoulos et al.[186], is the appropriate method 
for cloud computing environments. It is a distributed and scalable indexing scheme 
for multidimensional data, capable of handling both point and range queries. It 
is based on the combination of R-tree [12] and Bloom filters [187]. It is only used 


for multidimensional data. 


Efficient Multi-dimensional Index with Node Cube (EMINC) [185] isa 
multi-dimensional two-layer index. It provides fast query processing and efficient 
index maintenance. It is an approach for indexing large IoT datasets: a hierarchical 
approach to building a multidimensional index for a cloud system. It combines 


R-tree [12] and Kd-tree [188] for data organization. 


CG-Index [189] is a two-layer index constructed over the BATON network [190] 


and it employs B-tree to address one-dimensional high speed queries. 


R-Tree based index in CAN (RT-can) [191] is a multidimensional index- 
ing scheme in data centers. RT-CAN combines the CAN-based routing protocol 
(Content Addressable Network) [192] and the R-tree-based indexing scheme [12] 


to address efficient multidimensional query processing in a cloud system. RT-CAN 
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organize storage and computation nodes in an overlay structure based on an ex- 
tended CAN protocol. The RT-CAN index is constructed on top of local R-trees 
and published on the cluster servers. They use a method to mapping a selected 
R-tree node to a CAN node. However, RT-CAN is constructed in a p2p network, 
with nodes dispersed largely in the real world and unstable connections between 
nodes, resulting in unreliable services |193]. RT-CAN is not scalable regarding the 
dimensionality of the data. The original overlay network must be expanded and 
additional servers need to be added to reconstruct the index, thus costing. The 
query processing algorithms are designed to support point, range and KNN queries 


for the RT-CAN index. 


Local and Clustering Index (LC-index) has been proposed by Feng et al. 
[194] as a combination of the RT-CAN [191] index and the CG index [189] what 
enhanced the flexibility and the insertion of multidimensional range queries. This 
index is dynamic and supports the operations of insertions and deletions however, 


it presents a high cost of storage. 


Hierarchical Irregular Compound Networks (RT-HCN index) Hong et 
al.[193] proposed an indexing scheme that integrates R-tree [12] and a routing 
protocol based on Hierarchical Irregular Compound Networks (HCN). This scheme, 
called RT-HCN is proposed to organize storage and compute nodes in an HCN 
overlay, in server-centric cloud storage system. The RT-HCN is composed of two 
layers. In the first layer, the data is distributed in different servers and locally 
indexed using R-tree. In the second layer, the local indexes are distributed across 
servers as a global index. Although R-Tree is a balanced and dynamic tree, the 
search process degrades when the index data is large due to the fact that R-Tree 


has the overlapping multiple MBR regions. 
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RB-index [195] is an efficient and scalable multidimensional indexing scheme 
for the BCube topology [196] in modular data centers. RB-Index is a two-layered 
indexing scheme that integrates the BCube-based routing protocol and the R-tree- 
based indexing structure. It use routing tables of a set of switches, in order to build 
an indexing space with n dimensions. This space is divided into n subspaces. Each 
server creates its R-tree and then publishes its address and its MBR in its table 
of routing. In RB-Index, they are building several distinct indexing spaces with 
selected dimensions according to the query history. Every server takes over part of 
the indexing space according to a mapping scheme. According to the authors, the 
division of the space into several sub-spaces during the publication of the R-tree 


nodes in the form (ip, MBR) produces false positives. 


U?-tree Gao et al. [197] proposed a universal two-layer indexing scheme built 
on cloud storage system with tree-like DCN (Data Center Networks) topologies 
called U?-tree. The first layer, named local index, facilitates the query processing 
on local hosts. The second layer, named global index, locates the host in which the 
data is stored. The construction of the U?-tree starts with the build of the local 
index by using the B* tree [182] for local data to efficient query search. The global 
index indicates in which local host the data is located. Each host will manage a 
portion of the global index within a certain range. The U?-tree support point 
query, range query, and kNN query search however, the cost of the update and the 


maintenance of the distribution index is high. 


Continuous Range Index(CR-index) Wang et al. [198] proposed a continu- 
ous range Index (CR-index) for indexing observed data based on its value ranges 
and type attribute. CR-index builds a compact indexing scheme where a mea- 
surement data items and the observation data items are aggregated into boundary 


blocks based on their interval blocks. The indexes are built to answer range queries. 
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However, this approach is only able to index data with a unique dimension [199]. 
In additional the unique dimension in the CR-index made it useless for data of 


higher dimensions [198]. 


Complemental Clustering Index (CC Index) [200], is an additional index 
based on Key-value store. A secondary index table has been built for each indexed 
column. In order to improve the random readability, more detailed information 
of each record has been pushing into the secondary index table, thus the random 
reading might change to a sequential reading. Also, the author suggested some 
methods of optimization to support multidimensional queries. CC Index is simple 
to implement, however, it suffers from various drawbacks. firstly, it requires a 
large amount of additional storage space when there are many indexed columns, 
secondly CC Index does not support adding not support adding or deleting indexes 
after the table has been built. 


Update and Query Efficient index framework (UQE-Index) Maet al.[201| 
proposed an efficient update and query index framework (UQE-Index) based on a 
key-value store that can support a high insertion rate and simultaneously provide 
an efficient multidimensional query. The UQE-index divided data into two types: 
historical data and current data, which were indexed with different granularities. 
For the historical data, a finer combined index was applied. A spatial index was 
built inside a temporal index. For the current data, in order to handle high updat- 
ing pressure, a coarse-grained index was applied They used data partitioning and 
tree-based indexing to develop the HBase-based UQE-Index framework to make 
data management more efficient. According to them, the response time under the 
UQE index is lower than that of the EMINC framework. However, this framework 
supports only range query. The kNN search query, which is more generalized than 


the range query, was not tested. The UQE-index has proposed a complete index 
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structure that deals with spatial-temporal attributes, there are other attributes in 


the IoT domain [202]. 


SeaCloudDM [203] the continuous data generated from IoT devices is being 
received, stored and processed in a sea-computing layer. The products of the sea- 
computing computing layer are numeric key sample values that are considerably 
smaller than the original data from the devices. This key sample data is passed 
to the cloud data management layer for later processing. Relational Data-Base 
and Key-Value (RDB-KV) store combined cloud data management model is em- 
ployed to manage SQL queries and keyword search. However this method manages 
massive data from heterogeneous sensors in the cloud, which suffers from latency 


problem. 


Multi-attribute index [202] in this approach, four types of attributes are em- 
ployed: spatial, temporal, keyword and value. A specific indexing method is allo- 
cated for each attribute and the inclusion of these four indexes in a combined index 
needs a certain sequencing that determines the performance of the query search. 
The query search performance is improved by taking into account all possible se- 
quences and by automatically determining the most efficient combined index for 
each query. This approach focuses on enhancing the performance of queries search 
and authors do not specify the way to store indexed IoT data [202]. The Bt-tree is 
used for storing the temporal attribute and value attribute.The R-tree is used for 
store spatial attributes, That are usually used to describe the geographic location 
of IoT data. Since B*-tree is a balanced structure and the it can support range 
queries efficiently. However the authors, only considering numerical data which is 


one-dimensional data. 


89 


CHAPTER. 4 Big IoT Data Indexing 


S?R-tree [204] which integrates spatial and semantic information. It adopts 
two layers. The first is a spatial layer used R-tree to group objects according to 
their geographical coordinates. The second is a semantic layer, transforms the 


high dimensional semantic vectors to a low dimensional space. 


Distributed Access Pattern R-tree (DAPR-tree) [205] for spatial data 
retrieval in a distributed computing environment. The balance of the index struc- 
ture and parallelization of the workload between several main computation nodes 
allows rapid data recovery. For this reason, the authors apply the R-tree structure 
on a three-tier distributed environment: the principal tier is the input of the global 
index and manages the data partitions for the sub-tier. The sub-tier constructs 
a number of sub-tree indexes for different data partitions, this sub-tree adopts an 
R-tree, R-tree * or APR tree. A data and computational tier provides data and 
computational resources to take care of the operations of the DAPR tree. During 
the search for a given query, the master node sends the query to all partitions 
at the same time. Each partition searches locally. Then all partitions send their 
results to the master node. DAPR is an efficiently indexing approach for spatial 
data retrieval in a distributed environment, it assures the balancing of data dis- 
tribution, workload and data access. However this tree is limited to applications 
that have relatively stable data access patterns. It is not adapted to a dynamic 
IoT environment. In addition, the master node can overload when multiple queries 


arrive. moving object 


Block Grid Index (BGI) Yang et al. [13] proposed this index which is a 
distributed method for large-scale moving objects with two-layer: grid-based index 
and DBGKNN (a distributed k-nearest neighbours query search algorithm based 
on BGI). According to the authors, this work requires an optimization of the index 


structure by the incremental update of the kKNN query search when the objects 
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are moving. Moving objects represent the only type of data that the BGI method 


can index. temporele data 


In-memory based Two-level Index Solution in Spark (ITTIS) is another 
framework for processing temporal data in a real-time distributed system based 
on Apache spark memory [14]. ITTIS consists of three levels: the first level is 
the partition unit, which is responsible for partitioning all temporal data into 
distributed nodes. Each partition consists of a set of intervals. Each interval 
is defined by (start and end, value). The second level is the local index unit, 
every partition, is indexed by MVB-tree (MultiVersion B-tree) [206]. The third 
level is the global index unit, which is located in the master node. It is used 
to collect the intervals of all partitions in the master node of the spark Apache. 
They built the BST (Binary Search Tree) for all these intervals. The search for a 
query is done in two steps. The first step is to find the candidate partition. The 
query search is achieved in the BST tree by pruning the sub-trees that are not 
suitable. In the second step, the search begins in the candidate partition to find 
the searched record. It provides native support for querying big data. However, 
dividing the research into two stages is that it can take a long time. In additional, 
this framework only supports temporal data and a specific type of queries that can 


not be replaced by other types of IoT data. 


Distributed and Parallel architecture with Indexing for structural clus- 
tering using SCAN algorithem (DPISCAN) Kumar et al.|207| proposed 
this approach to thread-level parallelism on the Apache Spark distributed archi- 
tecture. A cache-based indexing technique creates indexes for the neighborhood 
vertices using CSS-tree [208] structure to take care of a combination of different 
threshold values. This work focuses only on data indexing and no query search 


method was proposed. geospatial 
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Geospatial data indexing In [209], a geospatial data indexing was performed 
in the cloud where a parallel R-tree [12] and its parallel variants were constructed. 
Three construction methods were used: Apache Spark in-memory, Apache Spark 
on disk and MapReduce. Each one is looking for the fastest way in building, 
updating and executing spatial query. One of these three methods, the Apache 
Spark in-memory, reduces significantly the time for indexing geospatial data and 
querying ranges. However, this method is only used for geospatial data where the 


dimension is limited to three. 


Three-level hierarchical index Hu et al. [210] presented a three-level hierar- 
chical indexing method to enhance Apache Spark and the Hadoop Distributed File 
Storage System (HDFS) for managing data from Earth observations and model 
simulations. They combine the global kd-tree index for the master node with the 
local hash table for each data node, which provides a scalable indexing strategy for 
searching large raster geospatial data in a distributed environment. They devel- 
oped a data distribution strategy to address query parallelism while maintaining 
high data locality. This method only supports querying large geospatial data and 
is not tested for other data types. 


Indexing within lossless compression Data Doan et al. [211] proposed an 
indexing model consisting of a lossless compression technique for IoT data as well 
as the benefits of bit-padding, bit-blocking, and Huffman coding. It minimized the 
data size during the compression, which does not require fixed 8-bit streams. The 
index is based on timestamps that supports access to compressed data without full 
decompression. which is linked in during the compression process. This framework 
focused on building indexing within lossless compression for floating point time 
series data. According to the authors, this framework needs to be enhanced by 


addressing temporal alignments and de-duplication problems when IoT streaming 
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data is sourced from multiple devices. However, the use of indexes based on 


timestamps made this method adapted for a specific type of data [211]. 


Textual and spatial objects indexing Bavirthi et al. [212] proposed an in- 
dexing mechanism combining textual and spatial objects for skyline querying. An 
inverted file is used for indexed textual objects and attached to the R*-tree. How- 
ever, adapting skyline queries in a dynamic environment like IoT is a difficult 
process when removing and inserting tuples at any time or in specific time inter- 


vals. 


SSKQR* Recently, other framwork in the literature are introduced to recover 
the most relevant data [213]. They introduced framework for spatio-textual skyline 
querying with R* tree indexing technique. To relate keywords provided by the 
user and the geometric data of the user with an efficient sky Rt named SSKQR*. 
The skylines of data with geometric information are recovered by searching the 
geometric data points closest to the user’s location. The recovered skylines are also 
verified by specifying a threshold value ’k’ named top-k skyline querying. However, 
the use of the R*-tree is much difficult during construction and maintenance. In 


addition this framework has no consideration for the IoT environment. 


Data lakes approach According to the Seattle Database Research report |214], 
a new approach called data lakes is proposed to store and analyze a huge amount 
of data. In this approach, data is flowed to a distributed storage system such as 
HDFS where they are analyzed and managed instead of uploading them to data 


warehouses which induced high maintenance cost [215]. 


Haystack queries The approach, proposed by Weintraub et al. [215], aims to 


optimize needle in a haystack queries in cloud data lakes. This approach consists 
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on the construction of an index structure that maps indexed column values to 
their files. Parallelism is used, in this approach, for ensuring the scalability in 
both compute and storage senses. The cost of this index is significantly high and 


the data load time is much longer. 


Hierarchical multidimensional indexing A hierarchical multidimensional in- 
dexing method based on binary space partitioning (BSP) was proposed by Wan 
et al. [216] for efficient spatial query processing. After evaluating k-d-tree, quad- 
tree, k-means clustering and Voronoi diagram data structures, they found that the 
Voronoi diagram data indexing method is suitable for general query operations 
with a response time of O(log(n)). However, the dimension limitation and the 


specific type of query make this method difficult to generalize. 


4.3. Metric Space Indexing Methods 


The characteristics of IoT data, the diversity of type, format and dimension require 
us to consider a metric space. Several benefits of searching in a metric space are 
available. The most important is that a larger number of data types can be 
indexed, as this approach is based only on the calculation of distances between 


objects and not on their content [15]. 


4.3.1 Centralized metric space indexing methods 


Centralized indexing techniques in metric spaces are classified into two main ap- 
proaches. The first approach partitions the space and it is divided into two con- 
cepts: the hyperplane partitioning and the ball partitioning. The second approach 
partitions the data (Figure 4.5). In the following, some centralized techniques, 


classified according to the two above mentioned approaches, will be presented. 
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Figure 4.5: Centralized metric space indexing methods. 


4.3.1.1 Space partitioning 


Indexing techniques, based on space partitioning, are grouped into two indexing 
methods based on the partitioning concept: indexing methods based on the hy- 
perplane partitioning and indexing methods based on ball partitioning. Other 
indexing methods, proposed to reduce the computation of distances in a metric 
space, will be also presented. In these methods, the metric space is partitioned 


into vectors by mapping functions. 


Hyperplane partitioning 


e BiSector tree (BS-tree) [217] is a recursive binary tree constructed basing on 
the generalized hyperplane partitioning. The coverage radii of each pivot are 
determined and stored in the corresponding nodes. The radius of coverage 


represents the maximum distance between the pivot and all objects in its 
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subtree. The type of search in this index is range query only. 


e Monotonous BiSector tree (MBS-tree) is a modification of the BS-tree pro- 
posed by Noltemeier et al. [218] in order to minimize the cost of computing 
distances when searching for a query range. The pivots in the nodes of the 
tree are minimized so that the pivots corresponding to the left subtree and 
the right subtree are copied in the corresponding inner child nodes, respec- 


tively. 


e Voronoi tree (V-tree) [219] is a ternary tree with each node representing at 
least two and at most three points of O. The root node is allowed to represent 
only one element. The V-tree is unbalanced structure. If a new leaf v has 
to be created in order to store a new point P and Q is P’ nearest neighbor 
of of the (three) points stored in the father node of v, then Q (redundantly) 
has to be stored in v too. The insertion of new objects into new leaves of 
a V-tree (for example objects O,, Os and Og (Figure 4.6)), induced a new 


space partitioning in Voronoi diagram. 


(a) (b) 


Figure 4.6: (a) Insertion of new objetcs in the V-tree. (b) Corresponding space 
partitioning in Voronoi diagram [219]. 
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e Generalised Hyper-plane tree (GH-tree) |220] is similar to BS-tree since both 
partition the dataset recursively via the generalized hyperplane principle. 
The distinction is that Generalised Hyper-plane tree (GH) uses the hyper- 
plane between the pivots p,; and p2 to determine the subtrees and not using 
covering radii as a pruning factor in the search process. The two points p; 
and p2, chosen randomly, partition the space into two regions. The other 
objects are associated to their closest pivot p; or p2 and thus, a generalized 
hyperplane that separates the dataset into two subsets is created (Figure 
4.7). The space complexity of GH-tree and BS-tree is 0(n) and O(nlog(n)) 
respectively and distance calculations are necessary to build the tree. The 
disadvantages of the above structure lie in the search operation, where for 
each node, two distance operations are applied, which results in a higher 
cost of the search, especially the chosen pivots no guarantee of the optimal 


partition of the space, making the degeneration of the indexes. 


(b) 


Figure 4.7: (a) Hyperplane space partitioning (b) Structure of the GH-tree [221]. 
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e Geometric Near neighbor Access Tree (GNAT) is a static and m-ary tree 
[222]. It is a generalization of the GH-tree. The difference is that GNAT 
uses n pivots in each internal node instead of two pivots (Figure 4.8). The 
data set O is divided recursively into n subspaces S = {5}, Sp---S,,} by the 
set of pivots P = {p),p2---Pn}. The rest of the objects are assigned to the 
subspaces according to the closest distance to a pivot of p; € P. This index 
increases memory requirements and computational costs due to selecting new 


sets of pivots repeatedly. 
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Figure 4.8: Example of partitioning used in GNAT-tree (a) and the corresponding 
tree in (b) [15]. 


e Evolutionary Geometric Near-neighbor Access Tree (EGNAT) [223] is an 


improvement of GNAT. It is dynamic and it allows node organization and 
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update after initial bulk loading, using placeholders for deleted (or changed) 
objects. However, it does not guarantee the non-overlaps between the inner 


nodes. 


e Balanced Metric space (BM-index) [224] is proposed as a solution for the 
unbalanced partitions in Voronoi diagram. It is based on pivot permutations 
scheme in the weighted Voronoi partitioning to eliminate under- and over- 
filled buckets (Figure 4.9). However, the calculation costs are very high when 
calculating the time for the convergence of the algorithm of construction and 


the complexity of the weight of the pivot. 


Figure 4.9: Voronoi cells for pivots p;,p2,p3: (a) 1-level tessellation, (b) pivot 
permutations [224]. 


e Voronoi Diagram tree (VD-tree) [225] is a dynamic Metric Access Method 
(MAM). It gathered the coverage radius strategy by the Slim-Tree [226] node 
partitioning heuristic flexibility and the rigid space partitioning of Voronoi 


diagrams. VD-tree reduces overlap between nodes by dynamically exchang- 
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ing overlapped elements. This method eliminates the overlap between nodes. 
It is based on the displacement of the element furthest from the represen- 
tative which implies the reduction of the calculation cost. However, the 


displacement of these elements leads to obtaining unbalanced partitions. 


e Complet Hyper-plane tree (CGH-tree) [227] is the combination of GH-tree 
and mVP-tree [228] by two pivots. It partitions the space recursecement with 


two hyperbolas and two ellipses. It is not guaranteed the balancing. 


Ball partitioning The benefit of ball partitioning is the fact that it only needs 
one pivot p and the resulting subsets contain the same amount of data, assuming 
that the median distance d,,, is utilized. Several indexes based on ball partitioning 


have been proposed such as: 


e Vantage Point tree (VP-tree) is a binary tree built on the partitioning of the 
space by the balls as a function of the distance [229]. In VP-tree, the pivot p 
is selected randomly (Figure 4.10). The median d,,, of the distances of the set 
of objects to the pivot is calculated. Then the median d,,, is used to define 
a ball B(p,d,) that will divide the space into two disjoint regions. The VP- 
tree is very costly in terms of distance and time computed, particularly in the 
high-dimensional data space in which the the number of branches retrieved 


is great [230]. 


e Multiple Vantage Points tree (mVP-tree) [231] is a generalization of the VP- 
tree [229]. It represents an m-ary version of the VP-tree. The nodes are 
partitioned into several " segments " by concentric rings with the center and 


equal cardinals instead of one. 
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Figure 4.10: Description of the VP-tree. 


In fact, it is based on the quantiles rather than the median. The function 
of this type of index is therefore very similar to that of the VP-tree. The 
construction time is in O(n.logn). mVP-tree improves the VP-tree [228] and 
a greater improvement of mVP-tree is obtained by employing many pivots 


per node [17]. 


e Memory based Metric tree (MM-tree) [232] this method divides the metric 
space successively into four regions using two balls which are constructed by 
two random pivots p; and p2 (Figure 4.11). Region I is the intersection of 
the balls. Regions IT and III are the differences of each ball from the other. 
Region IV is the rest of space. The distance between (pj, p2) is the radius 
of the two balls . However, the partitioning of the MM-tree may generate 
subspaces of very different sizes, which involves the production of strongly 
unbalanced structures [233]. In addition, this index does not support high 


dimensions. 
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Figure 4.11: Example of a MM-tree indexing of 8 objects [232]. 


e Onion-tree is the extended version of MM-tree [233]. It recursively parti- 
tions space into non-overlapping regions using hyper-spheres to define dis- 
joint subspaces (Figure 4.12). It introduces three features: a partitioning 
method that controls the number of disjoint subspaces generated at each 
node, a replacement technique that can change the pivots of leaf nodes in 
insertion operations and extended query algorithms kNN to support the new 
partitioning method and including a new visiting order of subspaces. The 
increase in the number of partitions in the space provides a fast indexing 
of complex data and accelerates search to answer similarity search queries. 
However, the issue with the above structure is the extended construction due 


to the reinsertion of objects. 
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Expansion 0 Expansion | Expansion 2 


Figure 4.12: Example of two expansion procedures applied to a node N [233]. 


e Intersection Metric tree (IM-tree) [221] divides recursively the dataset into 
five disjoint regions, by selecting two farthest points as pivots p; and po. 
The fourth region is partitioned into two regions using a plane. Figure 4.13 
represents the IM-tree building at a given stage of the recursive splitting 
process of the dataset. The regions I, II, III, IV and V collapse to level 2. 


The IM-tree structure is composed of: 
— Leaf nodes: consists a subset of the indexed objects. 
— Internal node: 
* N1 for the intersection. 
* N2 for the partial ball centred on pj. 
* N83 for the second partial ball centred on po. 


* N4 for the remaining space close to py. 
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* N5 for the remaining space close to po. 


Despite the fact that improves the MM-tree and onion-tree structure, the 
external region of the balls in the IM-tree, causes the degeneracy of the 


index for the massive data. 


Figure 4.13: Description of the IM-tree [221]. 


e eXtended Metric tree (XM-tree) [234] divides the space with spheres. They 
use two structures, sequential and tree structure, to reduce the volume of 
the outer regions of the spheres, creating extended regions as inspired from 
the X-tree [177] and inserting them into linked lists as extended regions, and 
excluding empty sets that do not include any objects. Figure 4.14 represents 
the XM-tree building at a given stage of the recursive splitting process of 
the dataset. The regions I, II and III collapse to level 2, the nodes eXtl1, 
eXt2 collapse to the same level. The elements are distributed according to the 
partitions to which they belong. XM-tree nodes have the following structure: 


Leaf nodes (objects), internal node (Normal Directory) and extended nodes 
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(chained list of objects). The advantage of this structure is that it is simple 


Seprated regions 1 Seprated regions 2 


Figure 4.14: Description of the XM-tree [234]. 


to identify to which partition an element belongs, while ensuring no overlap 
between nodes of the same level of the tree. In addition the extended regions 
help to speed up the kNN search due to the exclusion of some objects that 


are not needed to compute the relative distances of a query object. 


e Non-Overlapping Balls and Hyper-planes tree (NOBH-tree) [235] is based 
on the recursive division of the space into six regions by using two pivots 
(p1,p2) € O (Figure 4.15). The rest of the objects are separated so that the 
evaluation of the distance of an element s; into p; and p2 can only contain 
the region S;. This excludes overlapping regions when answering a point 
request. The distance between p, and pe is called the node hop and the 
regions are divided using both a metric hyperplane and two ball regions, 
where the radius of the ball r is the node hop. This method suffers from the 


high cost of insertion and search. 


105 


CHAPTER. 4 Big IoT Data Indexing 


Figure 4.15: Six regions that can be combined to create NOBH-tree members [235]. 


Mapping pivots partitioning 


e D-index is a metric structure at several levels by using the p-split functions, 
one for each level, to create buckets for storing objects [236]. Here, the p- 
split functions of individual levels use the same p. In figure 4.16, a p-split 
function based O7 is used at level 1, and a p-split function based on O3 is 
used at level 2. Objects in the exclusion bucket ‘-’ (i.e.,03, Os, Og) at level 1 
are candidates to be divided at level 2, and the exclusion bucket of the last 


level forms the exclusion bucket of the D-index [237]. 
e eD-index is an extension of the D-index with a modified split function [238]. 


e iDistance is a B*-tree based dimensional indexing method for similarity 
search in vector spaces [239]. The iDistance partitions dataset into n clusters 
C and establishes a reference point p; for each cluster C;, i € {0---n— 1}. 


Every object o € O is then assigned a numeric key according to the distance 
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from its cluster’s reference object, the iDistance key for an object is: 


iDistance = d(p;,0) +4.C (4.1) 
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Figure 4.16: Example of the D-index [237]. 


These reference points are used to transform the space into unidimensional 
for each partition. The formula maps all objects in any cluster C; to interval 
is: [i.C,(i+1).C] (Figure 4.17-a). Mapped objects are indexed by a Bt-tree 
and the search is performed by one-dimensional range queries. However in a 
range query R(q,7r) several intervals of the iDistance keys determined which 


need to be accessed in order to process the query (Figure 4.17-b). 
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Figure 4.17: Principles of the iDistance [240]. 


e Metric index (M-index) This index is an extension of the iDistance [240]. It 


partitions data using Voronoi diagram in several levels (Figure 4.18). In M- 
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index the clusters of iDistances [239] are replaced by the cells of Voronoi. The 
Voronoi cell centers and the corresponding objects to each cell are mapped 
by the iDistance method. Mapping of elements of a metric space into a 
numerical domain, allows the execution of precise and approximate searches 


algorithms using interval queries. 
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Figure 4.18: Dynamic cluster tree with 3 levels [240]. 


Space-filling curve and Pivot-based Bt-tree (SPB-tree) [241] In this struc- 


ture, the SFC (Space Filling Curve) function is used to portion the space 
into a compact region in the form H and transform the space into a one- 
dimensional space. The objects mapped by this function are indexed by the 
Bt-tree (Figure 4.19) . The SPBs are components of B*-tree with MBB 
(minimum bounding boxes)and the objects are stored in a RAF access page. 
The RAF page stores the objects in the ascending order of their SFC values 
and build and manipulates B*t-tree with minimal cost and minimal storage 
by regrouping data in compact regions. In addition, it features allow efficient 
algorithms for handling similarity search and similarity joins. However, the 
use of parallelism during the construction steps such as space transformations 


and prepossessing can be done difficultly [152]. 
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SPB-tree 


Pivot}mapping B -tree|indexing 
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Figure 4.19: Construction framework of an SPB-tree [227]. 


In D-index [236], eD-index [238], iDistance [239], M-index [240] and SPB-tree [241] 
for the mapping of the pivots in the space, the data in the metric space are forced 
with coordinates. However, this mapping is generally deformed. This means that 
the distance between two points in the metric space generally not equal to the 


distance between two points in the mapped space. 


4.3.1.2 Data partitioning 


The data partitioning of the set of points is done by their functions in relation to 


the selected pivots. Among these techniques we cite the following: 


e M-tree is a kind of dynamic and balanced metric trees, which supports con- 
secutive insertion [242]. Its leaf nodes store all the elements, while its in- 
ternal nodes store selected elements called representatives. Each one of the 
representatives has a covering radii in which, the data is partitioned into a 
ball with a pivot and radius (Figure 4.20). This method is dynamic and 
balanced however, its performance degrades by the overlap between nodes 
which increases the possibility of multi-way traverse. It is not scalable for 


high volumes of data [243]. 
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Figure 4.20: Descriptive scheme of the M-tree [18]. 


e Slim-tree is an improvement of the M-tree [242] in order to reduce the degree 
of overlap between nodes [226]. It introduces a new splitting technique based 
on the Minimum Spanning Tree (MST). The main drawback of this structure 
is the possibility of creating nodes that contain empty nodes, thus strongly 
limiting the performance of the index, mainly in the case of high dimensional 


spaces [244]. 


M-tree [242] and Slim-tree|226] are height-balanced structures that achieve 
very good performance both in terms of disk access and run time mainly 
because of the height of the trees are very short. However, the performance 


of these two structures degrades very easily because the overlap radius of 
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nodes and the overlap between nodes increases such that a large number of 


subtrees must be analyzed when processing a query [245]. 


e Density Based Metric tree (DBM-tree) is an extension of the Slim-tree [245]. 
It was the first dynamic MAM to control overlap, which minimizes the over- 
lap among high-density nodes by relaxing the height balancing rule (Figure 
4.21). Subtrees are made deeper in denser regions of the metric space, and 
less deep in regions with many more objects. It was discovered that reduc- 
ing the overlap of nodes indeed reduces the number of accesses to the nodes, 
improving performance. It is a balanced structure. It decreases the over- 
lap between balls and the number of distance calculations when searching a 
query. Despite these advantages the reorganization of the data for balancing 


the tree adds an additional computation time. 


e DBM*-tree is an improvement of DBM-tree which aims to avoid the recal- 
culation of distances in the choice of the appropriate sub tree during the 
construction process [246]. The authors proposed a matrix of distances be- 
tween all objects in each node. However, this solution has the disadvantage 
that the storage space for the distance matrix at each node is not sufficient 


in big data. 
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Figure 4.21: Description of the DBM-tree [246]. 


e M*-tree [247] is an M-tree extension with the creation of a super node, 
which is inspired by the X-tree (Figure 4.22). The M*-tree provides a large 
search area by extending the super node to the metric spaces completely. A 
new division method of nodes is introduced in the M*-tree to address the 
necessity of the low cost of index construction. In addition, an inner index 
is proposed in the M*-tree to transparently manage the CPU costs in the 
extended leaf nodes due to the introduction of the super node. 


= Super Nodes 
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Figure 4.22: Structure of the M*-tree [247]. 
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e PM-tree in this index, for every leaf overflow, a split algorithm is used to 
create a new node and to distribute the elements between them [248]. Each 
node promotes one element to the upper level that stores it and the coverage 
radius. The upper levels may be updated recursively, if necessary. This 
process guarantees the structure is always balanced. However, the problem 
is if an inner node split when splitting an inner node, selecting an element 
to be promoted and remove it from the node is not possible, as each element 
is a pivot that represents a branch. The algorithm employs the aggregate 
nearest neighbor query to solve this problem. This algorithm minimizes the 
sum of distances to the set of ball pivots, among other aggregation functions. 
This strategy building compact indexes that increase the performance of k- 
nearest neighbors. This is achieved due to the faster convergence of the query 


algorithms. 


e Super M-Tree is an extension of the M-Tree [242] where the approximate 
sub sequence and subset queries become nearest neighbor queries [249]. The 
authors introduced the spaces of metric subsets as a generalized concept 
of metric spaces. That use different function the distance, to calculate the 


distance between objects in internal nodes and its sub tree. 


e Hollow-tree is a strategy capable of handling missing data, caused due to 
the fact that they have not been observed or recorded [250]. It is mainly 
based on two different techniques CFMLI (Complete First and Missing Last 
Insert) and ObAD (Observed Attribute Distance). The CFMLI technique is 
used to index the observed data, with missing values (with NULLS) at the 
nodes of the leaves while the ObAD technique is applied to compute the set 
of distance functions, based on the possible combinations of observed and 


missing values, to estimate the similarity score according to the observed 
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attributes in the two elements. However, the existence of NULLS values 
could impact the indexing in a general way due to the fact that the leaves 


are full but contain NULLS data. 


Pre-computed distances methods Other methods use a distance matrix 
for storing pre-computed distances from each database object to a set of pivots. 


Among these methods we cite the following: 


e Approximating and Eliminating Search Algorithm (AESA) is generally re- 
garded in the literature as the more efficient MAM [251] [252]. It is based 
on an n?n matrix of distances between the n objects, which means that it is 
very costly in terms of calculating the distance to O(n”), which is why this 
method is not practical in the case of large data sets. From the authors point 
of view [251], the AESA is only suitable for for small datasets of at most a 


few thousand objects. 


e Linear AESA (LAESA) is proposed in ordor to minimize the construction 
cost of AESA [253]. It just needs O(kn) distances in the construction cost. 


However, it wastes some search efficiency compared to the original AESA. 


AESA and LAESA are static methods. They build an index structure based 
on fixed data sets and there is no way to insert or remove objects from the 


structure [254]. 


e Extreme Pivoting (EP) this index is based on the selection of a set of essential 


pivots (without redundancy) covering the entire database [255]. 


e Improvable LAESA (I-LAESA) is an improvement of LAESA [256]. It re- 
duces the calculation of distances in the search for a query. LAESA only uses 
an exact distance from k pivots to other objects, on the other hand I-LAESA 


takes an additional estimated value distance. The estimated distance is not 
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expensive to calculate and it is possible to be updated during the search and 


to be approximated to the exact distance piecemeal. 


All these methods use a distance matrix to remove the objects and avoid some 
distance calculations during the search. Nevertheless, these methods require more 
space to store the pre-computed distances, and their I/O costs are often high 
because the data is not clustered in this way [252] [257]. In addition, these methods 


are not adequate for big loT data. 


4.3.2 Distributed metric space indexing methods 
The distributed methods, cited in what follows, are grouped in figure 4.23. 


Metric space indexing methods 


Tree methods 


Centralized methods Distributed methods 
M-CAN MESSIF Distributed M-tree 
([239],2006) ({228],2007) ADMS ([263,2019) ([264],2020) 


M_chord 


([259],2008) 


GHT* 
([258],2008) 


DM-index BCCF 
([6],2018) ([5],2020) 


vPT* 
([258],2008) 


GHB 
([262],2018) 


Figure 4.23: Distributed metric space indexing methods. 


GHT* Authors in [258] presented the GH-tree in parallel. The aim is the dis- 
tribution of data storage on several servers. The AST (Address Search Tree) 


represents a binary search tree that is based on the GH-tree [220]. The GH-tree 
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is generated in each server and client, this tree is charged to the storage and to 
find the queries. The leaves of this tree are pointers to the buckets BID (Bucket 
IDentifier) or point to another server by NNID (Network Node IDentifier). This 
solution is efficient for data distribution. However, they are used for range queries, 


so they do not cover the kNN search. 


VPT* is a VP-tree distributed in a P2P network [258]. The AST (Address 
Search Tree) represents a binary search tree that is based on the VP-tree [229]. This 
tree is charged to the storage and to find the queries. It is semilar to GHT™ [243] in 
the structure of the inner node and the leaves, with the exception of in the case of 
the VP-tree structure, just half of the distances are store with respect to GH-tree, 


as only one pivot is contained in each each inner node. 


M-Chord this index uses the M-tree to index local peer data [259]. It is based 
on the mapping of the data space into a one-dimensional domain and traverses this 
domain using the Chord routing protocol [260]. The M-chord operates a vector 
index method iDistance [239] that divides the data space into clusters C;, finds the 
reference points p; in the clusters and defines the one-dimensional mapping of data 
objects based on their distances from the cluster reference point. When searching 
a range query, the space to be searched is specified by iDistance intervals for such 


clusters that intersect the query sphere. 


M-CAN it combines CAN and iDistance [239] for similarity search in metric 


space .In this method, the set of pivots P={p1, p2,--- , py} are used to map objects 


o € © to an N-dimensional vector space R% [261]. The used mapping function F(o), 


applied on the set of objects O, F : O — R% is defined as: 


F(0) = (d(o, pi); d(0, p2), ++ , (0, py)) (4.2) 
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The pivot based filtering is used to reduce the number of evaluated distances. 


For routing, each peer manages a coordinate based routing table containing the 


network identifiers and coordinates of its neighboring peers in the virtual RN space. 


In the range query search the routing algorithm transmits the query to the neighbor 


with the region closest to the target point in the vector space. 


The above cited approaches GHT*, VPT* and M-can do not mention algorithms 
for kKNN queries. M-chord and M-can are efficient for data distribution. However, 
GHT*, VPT* and M-can are used for range queries only, so they do not cover the 


kNN search. 


GHB-tree is inspired from GH-tree [262]. The first idea is to limit the volume of 
the space. The goal is to eliminate some objects without the need to compute their 
relative distances to a query object. They proposed a parallel search algorithm on 
a set of real machine, in p2p network [262]. This tree has two pivots p; and p2 to 
split the space into left and right regions using a hyperplane (Figure 4.24). The 
leaf nodes, in the left and right subtrees, contain a subset of the indexed data with 
a maximum cardinal equal to Cmax. This index has proven its efficiency during 


kNN search when compared with the onion-tree and the slim-tree [262]. 
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Figure 4.24: Parallel version of GHB-tree [262]. 


Asynchronous Metric Distributed System (ADMS) is a three-levels dis- 


tributed architecture for processing similarity queries in large-scale metric spaces 
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[263]. In the first level (Figure 4.25), the root peer mapped the data to a vector 
space and requests a mission to distribute the data between the master peers. In 
the second level the master peers received a mission of data distribution, they com- 
municate with each other to divide the data (using the principle of requester /edi- 
tor). For the distribution of data they are used Minimum Bounding Box (MBB) 
of the R-tree . After dividing the data, each master peer is assigned its data set. 
The data set is divided into equal parts and distributed to their peer workers. In 


the third level, each peer worker builds its index using M-tree. 


In ADMS the objects are recursively divided into disjoint equal-sized partitions, 
by master peer. They continue to divide their own object fragments into equal 
size fragments and distribute them to their child peers. The M-Tree is used for In- 
dexing the objects distributed to the peer worker in the vector mapped space. .In 
additional, they introduced the publish/subscribe communication model to asyn- 


chronously exchange messages to decrease time wasting in network interactions. 


Root Peer 
Ss Master Peers 
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Figure 4.25: Structure of AMDS architecture [263]. 
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Distributed M-tree this method uses M-tree to solve the similarity queries on 
complex data in multimedia databases only [264]. It is distributed on the Apache 
Spark framework. The similarity search query uses the kNN algorithm and only 
the first k response vectors are retained to be sent to the master. The rest of 
response vectors is ignored. This drawback reduces the efficiency of the kNN 


search algorithm. 


Deployment Model (DM-index) this index was proposed for maintenance 
and recovery in the fog [6]. It is developed for eliminating redundancy, narrowing 
the search space and reducing the number of traversed services and recovery time. 
However, this model index is used only for industrial loT data and was not tested 


for other types of IoT data. 


Binary tree based on containers at the cloud-fog computing level(BCCF- 
tree) In this model [5], indexing of IoT data is performed at the fog layers due 
to their processing power, latency reduction and node distribution. This model 
can be adapted to emerging IoT technologies to improve the quality of indexing 
IoT data in real time. In theBinary tree based on containers at the cloud-fog 
computing level (BCCF), the space is recursively partitioned into two subspaces, 
centred by two pivots p; and p2 determined using the k-means algorithm with 
k=2 [104], to achieve a balanced partitioning with minimum overlap in order to 
reduce the computational cost and the complexity of the similarity search process 
(Figure 4.26). Despite the efficiency of the BCCF-tree, it presents some inconve- 
nient. The k-means algorithm, which is used during the BCCF-tree construction 
for overlap decreasing, increased the computational cost and the complexity. In 
the BCCF-tree, as well as all the metric space methods, indexes degenerate due 


to the continuous growth of the collected IoT data. 
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Figure 4.26: Partitioning of space with BCCF-tree [5]. 


4.4 Comparative Analysis of Indexing Methods 


4.4.1 Multidimensional space indexing methods 
4.4.1.1 Hashing methods 


It is a more useful method in the field of multidimensional data indexing because 
of its ability to transform a data into a low dimensional element representation 
(short code composed of a few bits) [160]. The hashing methods are regrouped 
into Locality Sensitive Hashing methods (LSH) and Learning to Hash methods 
(L2H). Each method is categorized into centralized or distributed method. 


Table 4.1 presents a summary of advantages and disadvantages of LSH methods. In 
centralized LSH methods several hash tables are necessary to guarantee the quality 
of the search. However, due to the limitation of storage space and processing 
capacity of the server, centralized indexing schemes become impractical for big 
data . Consequently, several distributed indexing schemes based on peer-to-peer 
(p2p) networks are proposed while how to ensure load balancing remains one of 
the key issues. In addition to the question of quality guarantee, indexes have to be 
built for different radii. Thus, in this case, hundreds or thousands of hash tables 


will be built, resulting in high space and high cost search. 
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Table 4.1: Summary of locality sensitive hashing methods 
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in multidimensional 


space. 
Category Advantages Disadvantages 
-Reduce the need to hav Itiple hash tables é ‘ ; 
C2LSH Be sets . ee aie cae -The accuracy of C2LSH was still not high [144]. 
Centralized [143] [265]. -It is not scalable. 
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3 PDA-LSH -It can offer efficient support for both -The construction of the LMS-tree is very expensive 
= [13] searches and updates. in terms of compaction. 
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2 I distance between the points after obtaining the results 
& 3 2p -Using the PM-tree improves query processing time. of a query. 
& iid PM-LSH -It uses an adjustable confidence interval to better -The high storage space consumed by the hash tables. 
ra 
Ss} 2/8 142 use distance estimation and provide more -The complexity of the query search needs the 
a\s P plexity } 

es W) o accurate results [265]. computation of the hash function of a query in 
3/4 fe addition to the computation of probability on the 
5 s A candidate points. 
A we Near bucket-LSH -It limits the searching process to the compartments -Insufficient account for the load balancing distribution 
& 2 Distributed 47| to which the query is mapped. p2p network. 
= S methods LFFIR -Insufficient account for the load balancing problem, 
3 4 i 48} -Scalable content-based image retrieval that is one of the key issues on the overall performance 
ef of the distributed system. [150] 

DSLM -Insufficient account for the load balancing problem, 

49] -It can achieve high retrieval rates and mobility resilience. | that is one of the key issues on the overall performance 
of the distributed system . 
Bal; d and distributed ' ea : 2 ; en ety . : 
. TSH [150] Mowe’ | Tt is a balanced distributed indexing scheme. -It is a static distributed indexing scheme. 
H [15 


Table 4.2 presented a summary of advantages and disadvantages of learning to hash 


methods (L2H). ODMVH and RDSH are unsupervised methods and the learned 


hash codes will suffer from limited semantics and discriminative capability. Fur- 


ther, they adopt simple fixed modality weights and binary projection mechanisms, 


which cannot adapt the variations of streaming multimedia contents and handle 


the modality-missing problems. In the context of the IoT environment, distributed 


indexes are very difficult to find the labeling of all different loT data. Furthermore, 


they are not suitable for a large and dynamic database and the learning costs are 


very high. 
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Table 4.2: Summary of learning to hash methods in multidimensional space. 


Category Advantages Disadvantages 
ODMVH -It can adaptivly increase hash -ODMVH has limited performance because 

g Centralized 55 codes according to dynamic it is unsepervised and has not exploited 
> 3g 3 2 methods changes in the image. any discriminitive semantic information [267]. 
es a & RDSH_ | -It generates a very compact -It is not appropriate for a large and dynamic 
&2|s : 51 hash code. database. 
aa| & 
I =| So DISH -TI sts are distributed i : ; 
g i B Distributed : Pe ere ree -High cost time. 
BR | |e 56 balanced way. 
oS }a@ |] methods = = 
SS s SupDISH -Effective compact hashing 
Boo P 57 Less memory consumption -Difficulty in learning the binary codes. 
- and calculation cost. 


LSH methods operate with the predefined hash functions regardless of the un- 
derlying dataset, where L2H learns custom hash functions based on the dataset. 
While there is an additional training step necessary, some studies have shown ex- 
perimentally that L2H outperforms LSH in terms of query search efficiency [158], 
[159],[160]. However, the Hamming distance is a gross indicator of the similarity 
between the query and the elements of a bucket as it is discrete and has a lim- 
ited number of values. Hamming ranking (HR) may not define a good order for 
buckets having the same Hamming distance from the query. As a consequence, 
HR generally probes a large unfavorable number of adverse buckets leading to low 
efficiency. A solution is to employ a long code so that the Hamming distance can 
classify buckets into larger categories. However, the long code has challenges such 
as sorting time consumption, high storage demand and low scalability in particular 


for large-scale datasets |161]. 


4.4.1.2 Tree methods 


Centralized methods ‘Table 4.3 presents a summary of advantages and disad- 
vantages of centralized tree based indexing methods in multidimensional space. 
These methods are simple and easy to maintain, However, due to the limitation of 


single-machine resources, they can not support the data generated by the devices 


12 


CHAPTER. 4 Big IoT Data Indexing 


of IoT that require high concurrent access to big data, and they are distributed 
in different regions. In addition, due the considerable increase of volume of data 
generated by IoT devices, all centralized methods suffer from a common drawback, 
namely, the degradation of the efficiency of large-scale indexing structures. They 
are used for the indexing of a specific kind of data and constructed for a predefined 


dimension of data. 


All of the discussed drawbacks show that central indexing in the multidimensional 


space is not capable of indexing a huge and a growing volume of IoT data. 


Distributed methods ‘Table 4.4 presents a summary of advantages and disad- 
vantages of distributed tree based indexing methods in multidimensional space. 
In distributed indexing methods, because of the data storage architecture, data 
management models and data processing methods are very different from the cen- 
tralized system. The indexing structure cannot be easily transplanted into the 
distributed system. The distributed indexing methods suffer from the location of 
the data index, the method of accessing the data index and the method of retriev- 
ing the data after indexing. Despite the existence of efficient indexing methods in 
high dimension, each distributed method is used for the indexing of a specific type 
of data. For example, the S*R-tree [204] is constructed to index, only, spatial- 
temporal data and the DAPR-tree [205] is built for, only,indexing geographical 


coordinates data. 


123 


CHAPTER. 4 


Big IoT Data Indexing 


Table 4.3: Summary of centralized tree indexing methods in multidimensional 


space 
Category Advantages Disadvantages 
Kd-tree -Costly and arbitrary 
a [162] Pauley Meerene Neate -Performance limited by data dimension 
: s Bs eee a Insufficient search performance 
iy z [166] | - Efficient search for point queries. aes ) 
s d-t ; 
m "a 63) -Efficient storage and retrieval -It is not balanced 
-It suffers from the problem of overlap 
R-tree | -Dynamic and balanced a ee: ; 
-The time and the complexity of the 
[12] structure. es Ssae 
computation increase in high 
dimension. 
R*-tree | -Eliminates the overlapping rectangles. 
n = Ve 
2 174 -More efficient than the R-tree. yeh ea 
- R*-tree | -Eliminates the overlapping rectangles. 
© : vy. 
F 4 175 -More efficient than the R-tree. Eee eay 
2 9 X-t 
£ an ek Reduce overlap rate It is very comple 
2/43 177 -Reduce overlap rate. -It is very complex. 
e/a SR-tree -Very costly in insertion 
713) q -Reduce overlap rate. 
Si alo 172 and research. 
A g s ao | SUSHI-tree | -Reduce the dimensionality of -The quality of clusters is not 
= a g 4 2 173 the feature space. guaranteed. 
A a| $8 -It is very difficult to manage R-tree 
a 0} A 5 TPR-tree | -It supports the queries for present with the change of the number of moving objects 
at 178 and future positions of moving objects. and the interval queries 
4s) cover the whole tree. 
E TPR" tree | -It minimizes the bounding It cannot handle historical queries 
La) -1t Ca , Na ,OT1Cal . 
a 179 rectangle for reducing query cost[180]. 
D-tree | -Efficiently answers a wide -Not appropriate for the growing number 
180 range of queries. of moving objects. 
BB-t 
ee *° | Balanced and dynamic structure. -Does not support the kNN search. 
B-tree | -Balanced in insertion and deletion. -Large storage space is necessary. 
167 -Efficient for kNN and range search. -Maintenance is costly 
B+-t are 
‘i a3 a -Storage is minimized. -High complexity. 
STCB-tree | -Efficient use of storage space. -It is not scalable to support a very 
183 -Reduce index maintenance. high rate of updates. 
B-tr ran ; 
y 1 a °° | Efficient processing for spatiotemporal data. | -Not suitable for large data. 
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in multidimensional 


space. 
Category Advantages Disadvantages 
al indexing fr: k ohe ke : : . 
eoneial an taal sails -Scalable and flexible index. -It is not well suited for cloud systems [185] 
A-tree -Scalable indexing scheme. -Requires large number of servers. 
186 -Capable of handling both point and range queries. | -Limited performance. 
EMINC -It ides fast y ssing. a : 
= Lao pee cerita Caen oy en ean -Overlapping in cubes nodes of R-tree. 
185 -Efficient index maintenance 
CG-Index panes , : : . 
189 -Efficient update and query performance. -Supports one-dimensional queries. 
RT-can , ‘ ‘ : : : ‘ 
191 -Supports point, range and KNN queries search. -Not scalable regarding the dimensionality of the data. 
Candee -Dynamic and supports the operations of sHivh torbot atorases 
194 insertions and deletions. 
T-HCN ind -Effici in space and ry search. week d Saas 
K ee eae vena eae ona -Overlapping in R-tree nodes during the publication. 
RB-index -Efficient and scalable. air f was 
; : - rlay -t des d th blication. 
195 -Supports point, range and KNN queries. Groveapemein, Hetree hoder dune shespMbieation 
7) 3 > : : 
-t . : -H st in th date and the maintena: f 
y ar e) -Supports point, range and kNN queries search. Saher Ms Han NG aa ae eam 
a CR-index Comouerindaaie scheme -Able to index data with a unique dimension 
YY Is aC 5 . 
8 198 P e [199]. 
z a CC Index -Requires large storage space. 
- z 200 -Simple structure -Not support updating indexes after the table 
zI o|s has been built. 
S s 3 UQE-Index -Support a high insertion rate 
A Sia 901 -Simultaneously provide an efficient -Supports only range query. 
g a] © multidimensional query. 
S| S| 3 saCloudDM -Suffers from the latency probl 
Ba & o peat iou -Able to index continuous IoT data. hie DNR eae tenes pee sD 2 
gS el 203 in the cloud computing. 
2 A Multi-attribute index -Balanced structure. -Only considering numerical data which 
5 202 -It can support range queries efficiently. is one-dimensional. 
a S2R-tree Anteraves WObseaabAnt -The conversion of high-dimensional vectors into 
= ; ee web j a low-dimensional space may lose the origin semantic 
= 204 semantic information in the index. ; 
3 of data. 
DAPR-tree -Balanced index. -It is not dynamic. 
205 -Efficient index for spatial data retrieval. -Overload when several requests are received. 
hate -Requires an optimization in the incremental 
Block grid index ? . ( : : : ; 
[13] -Supports indexing a large-scale moving objects. update of the kKNN query search when the objects 
are moving. 
ITTIS : : ; ; , Dp . : 
[14] -Suitable for processing temporal data in real time. | -Extensive search time. 
DPISCAN Sew ; : : : 
207 -Efficient index for large-scale moving objects. -No query search method. 
Geospatial data indexing} -E ficient search for range queries “Dimension Kimited to three. 
209 for geospatial data. 
-Lossless compression Data -Enhanced of lossless compression -Temporal alignments and deduplication IoT 
211 indexing in IoT. streaming data not addressed. 
Textual and spatial objects indexing -Minimization of search space by the use of Dithcalt usdatein lol envionment 
212 pruning technique. 
SSKQR* -Efficient index to retrieve the most ' : ; 
-Cost of construction and maintenance. 
213 relevant data. 
Data lakes approach nie : : ; ; 
215 -Efficient storage for huge amount of data. -High maintenance cost. 
aN a -Scalability in both computing and storage senses. | -Costly data load time. 
Hierarchical multidimensional indexing | .,. . F Fe =~ ; ‘ 
216 -Efficient spatial query processing. -Limited dimension 
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4.4.2 Metric space indexing methods 


Centralized methods Table 4.5 presents a summary of advantages and dis- 
advantages of centralized tree based indexing methods in metric space. These 
methods are based on the successive division of the space into subspaces. This 
kind of methods faces the rapid growth of regions and subspaces due to the con- 
tinuous growth of data which consequently, leads to the degeneration of the index. 
Another issue is the overlapping between these subspaces which is not solved effi- 
ciently. As the volume of data generated by IoT devices has increased considerably, 
traditional centralized indexing methods became usefulness due to the limitation 
of the processing capacity which reduces the overall performance of query-based 


search. 


Distributed methods ‘Table 4.6 presents a summary of advantages and disad- 
vantages of distributed tree based indexing methods in metric space. Distributed 
methods in metric space are able to index any type of data. The distribution of 
indexes in several local indexes will allow a big data indexing. Nevertheless, the 
question remains how to distribute the indexes, and how to retrieve the data in 


these distributed indexes. 
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Table 4.5: Summary of centralized tree indexing methods in metric space. 


Category Advantages Disadvantages 
-Static structure. 
BS-tree -Simple partitioning. -High computational costs. 
[217 -Reduce overlap rate. -High cost search. 
-Support only range search. 
eh MBS-tree -Reduce the cost of search compared -Static structure. 
a [218 with the BS-tree. -Support only range search. 
iS} -Static structure. 
2 V-tree -Reduce overlap rate. 
oo ‘ . : -Unbalanced structure. 
B [219 -Simple implementation ’ : ‘ : ; 
a -Reinsertion objects is largely costly. 
g GH-tree -Reduce overlap rate. -Static structure. 
a [220 -Simple implementation -High cost search. 
3 GNAT Novoverlapping -Static structure. 
a [222 Pping- -High computational costs. 
EGNAT : en : : : 
[223 -Needs lower CPU time than the GNAT tree. -Difficulty in balancing the index. 
D-tree -Dynamic structure. 
¥ oe Ne Bice ac -Unbalanced structure. 
3 [225 -Reduce overlapping rate. 
g VP-tree : . ; = ; ' : 
= [229 -Simple implementation. -High cost in terms of computed distance and time. 
as mVP-tree -Static structure. 
=| -Reduces research costs- 
75 [231 -Support only range search. 
5h) a? MM-tr -No overlapping regions. 
=a ae eed name ant meal -Unbalanced structure. 
| 3 [232 -Dynamic structure. 
A) Onion-tree -Improved space partitioning as compared to MM-tree. : war . 
oft ; ; -Insertion of objects creates a semi-balance. 
g| € [233 -Dynamic structure. 
Ou | IM-tree id “ : ee 
sealer [221 -Efficient when comparing to MM-tree and Slim-tree. -Index degeneration in large-scale data. 
faa) 
XM-tre ; ‘ 
(234) -Fast kNN search. -High memory requirements. 
n wn 
zis NOBH-tree : ae : ‘ ‘ 
ef) 9 se ree | -No overlapping of the divided data space -High cost of insertion and search. 
S/S é 
=a) ¢ 20 D-index -Reduction of distance calculations. m Lota tad < 2 
a | a S i ae -The mapped of points deformed the distances. 
ben re _ [236 -No overlapping of the divided data space. 
2 |S g eD-index F eyo fa we ie -Efficient for small query radii only. 
3 | 5 -Suitable for similarity self join. a i a y ‘ y ; 
Se 5 [238 -The mapped of points deformed the distances. 
oy | acs a aa ; : ; -kNN search fi 1 using -dimensio: 
81d a iDistance -Reduction of distance calculations. BUN ae Gangs gee ganar Un peer eee ve 
s = : aa range search. 
oa 2 [239 -No overlapping of the divided data space. ne ee ‘ é 
iS aa -The mapped of points deformed the distances. 
Bp ) F -Reduction of distance calculations. 
@ z| M-index one A ioe an aoa ? 
s a. [240 -No overlapping of the divided data space. -The mapped of points deformed the distances. 
=z -Efficient search in comparison with the iDistance. 
ian SPB-tree -Reduce the cost in terms of storage iy cree . 
: -The mapped of points deformed the distances. 
[241 construction and search. 
M-tree : -Not scalable for high volumes of data. 
¢ -Dynamic and balanced structure. 
[243 -High cost search. 
BHDNUNCE =o) yneenio sihuchir: Degradation performance in processing a query 
[226 -Reduce overlap rate. o a : PIRI V ae ered 
DBM-tree -Dynamic structure. Hepansise Constrosish 
[245 -Reduce overlap rate. PSs : : 
DBM*-tree | -Reduce the cost of construction. Nicaea hetero erate 
2 [246 -Dynamic structure. par e ie ca 
g & M*-tree -Low cost of index. Hish coat oF sebich 
Qs [247 -Dynamic structure. & 7 7 1 
= 
x -Dynamic structure. 
PM-tree -Reduce the distance calculations. ‘ : , : 
: : ; -High computational costs of construction. 
[248]. -Compact indexes increasing the performance 
of the similarity search. 
Super M-Tree : pa , : : 
[249 -Able to address approximate requests for subsequences. | -High computational costs of construction. 
Hollow-tree ont Rah ae, xe — : 
[250 -Able of handling missing data. -Not support high volumes of data. 
-Static structure. 
AESA ’ ; 3 -High computational costs of construction. 
ane -Simple implementation. i : 
3 [251 -High cost of storage. 
3 g Not support large data sets. 
as -Static structure. 
6s LAESA : ; 
© 4 [253 -Reduce the construction cost compared with EASA.- Not support large data sets. 
& ae -High cost of storage. 
I-LAESA -Reduce the distance calculations in query search -High cost of storage. 
[256 compared with LAESA. Not support large data sets. 
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Table 4.6: Summary of distributed tree based indexing methods in metric space. 


Category Advantages Disadvantages 
GHT* -Parallelism speed up the search. -Support only range search. 
[258 -No overlapping data distribution between servers. | -Unstable connections between nodes of p2p network 
VP-tree -Parallelism speed up the search. -Support only range search. 
[258 -No overlapping data distribution between servers. | -Unstable connections between nodes of p2p network 
a M-Chord -Effective split of local metric based indexes. END and Pee ACEY. ncarck: pevionacdivsine 
3 ie eae one-dimension. 
5 [259 -Efficient similarity search. F 
Sle -Unstable connections between nodes of p2p network. 
z 3 M-CAN -Similarity query search performed using one-dimension. 
20 3 [261 -Effective split of local metric based indexes -Unstable connections between nodes of p2p network. 
% I ml i -Support only range search. 
Die -Balanced structure. 
= 2 aes -No overlapping data distribution between -Difficult to balance the index. 
= 2 nodes of p2p network. 
a | ADMS tips F ; ‘ 
2 a [263 -Balanced distributed system. -Risk of network saturation during message exchange. 
o . : jl 
a Sed Mies -Support high volumes of data. -Inefficient kNN search. 
a ae oe ak -Unbalanced index in terms of load of fog layer nodes. 
BCCF-tree -Efficient kNN search. -Costly building process. 
[5] -Balanced partitioning of the data. -Degradation in large scale. 


4.5 Conclusion 


In this chapter, a review of the literature on big IoT data indexing is presented. A 
new taxonomy of indexing techniques, in both multidimensional and metric spaces, 
is proposed basing on their grouping into centralized and distributed methods. For 
the whole indexing methods, in both multidimensional pace and metric space, a 
comparative analysis was done by pointing out the advantages and the disadvan- 
tages of each index. Indexing methods in metric space present better performance 
compared with the multidimensional space which was awaited since, in metric 
space, data objects are defined by distances. In the other hand, the few dis- 
tributed indexing methods in metric space are more efficient than the centralized 
indexing methods in the space. In the next part, we will propose some distributed 
indexing methods developed, in this thesis work, in metric space in order to solve 
some issues raised previously. The similarity query search performance in these 


indexes will be tested using the kNN search method. 
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5.1 Introduction 


After studying the state of the art of centralized indexes in the multidimensional 
space or in the metric space, we find that they are limited due to the fact that 
they suffer from a common drawback of degradation of efficiency in large scale, 
which makes these methods inefficient for indexing IoT data. This inefficiency 
also leads to the need for index distribution for ensuring the rapidity of the query 
search process. The majority of distributed indexes in multidimensional and metric 
space stored in the cloud [194], [193], [195], [197], which posed various challenges. 
High or unpredictable latency due the long distances between users and the cloud. 
High uplink bandwidth requirements, gateways that do not have the bandwidth 
capacity to upload certain types of sensors data to the cloud will not be able to 
use the cloud-based storage and processing approach. No in-network filtering or 
aggregation. Some applications cover a large geographical space, whereas only 
an aggregate value of the sensors is actually important. Uninterrupted internet 
connection required [22]. Indexes in the multidimensional space are more robust 
due to their strong dependency on type or, more precisely, on their geometric 
properties. This feature makes these indexes only specific for a certain data type 


which implies that indexing the various types of IoT data is very difficult [39]. 
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To solve these challenges, we propose to relocate the indexing process from the 
cloud to the fog nodes in order to bring the data as close as possible to the indexing 
structure and therefore, considerably reduce network congestion. In addition, each 
fog node generates its own indexing structure, which not only allows parallelism 
during the construction of trees, but also parallelism in the similarity query search 
process through the simultaneous launch of the same query on all fog nodes. In 
each fog node, a clustering method is used as a pre-indexing process. The use of the 
density-based spatial clustering of applications with noise (DBSCAN) algorithm 
allows data to be partitioned into homogeneous groups which will be indexed in 
parallel. This process promotes the creation of a balanced trees with a minimum 
degree of overlap between the leaves of each tree. Indeed, DBSCAN is a density- 
based clustering method that stands out for its ability to automatically create 
clusters with almost zero inter-class similarity. To ovoid the limitations of indexing 
structures in multidimensional space we choose the metric space. The metric 
space approach has been found to be very important in building effective indexes 
for similarity searching. Our index structure is implemented in the metric space 
because it seems to be the right compromise since, in this space, only distances 
between data are used regardless of their types and dimensions. In additional, 
we used tree indexing structures, that is dynamic structures with data changes. 
The complexity of the insertion and search in tree structure is logarithmic. This 
means that the search time is reduced logarithmically depending on the number 


of indexed objects. 


In this chapter, we propose a new system for indexing and retrieving data in an 
IoT environment. The so called Binary tree based on containers at the cloud- 
clusters fog computing level (B3CF) allows dealing with the index degradation 
and network congestion while ensuring minimal kNN search time with optimal 


results quality by the introduction of clustering using the DBSCAN algorithm as 
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a step before indexing. The clustering process, in its turn, allows the introduction 
parallelism for the indexing of separated clusters and also in the similarity query 
search using the kNN search method. The proposed approach will be presented in 
detail followed by the simulation and results of the indexes construction in terms of 
the number of calculated distances, the number of calculated comparison, the time 
of indexes construction and the indexes quality. The parallel kNN similarity query 
search will be also tested by the number of calculated comparisons, the number 
of calculated distances, the time of search and the number of the visited leaves. 
The examination of the performance of this proposed index will be performed 
by comparison with some existing indexes namely BCCF-tree [5], [WC-tree [5], 
MxX-tree [247] and BB-tree |176],[201]. 


5.2 Proposed Approach 


Similarity search queries, in an IloT environment, is very complex due to the ex- 
ponential increase in data, which needs to be organized. In this approach, the 
collected data is grouped into clusters, using a clustering algorithm, before their 
indexing in parallel. Parallelism is an efficient tool to speed up the index con- 
struction time and also the search algorithm. Like the BCCF-tree [5], the system 
architecture consists of three layers: the IoT sensor layer (or terminal layer), the 
fog layer, and the cloud layer (Figure 5.1). The terminal layer sends the data 
generated by the interconnected devices to the fog layer. The fog nodes are close 
to the terminal devices and have the ability to compute and store the data [268]. 
In this approach, the fog layer is divided into two levels (Figure 5.1). In the first 
fog level, the data sent by the terminal layer is collected and aggregated. In the 
second fog level, the data from each cluster is indexed and trees are constructed. 


The leaves of the nodes in the constructed tree are stored in the cloud layer. 
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Cloud layer 


Level 2 


Fogs layer 


Level | 


Terminal layer 


Figure 5.1: Cloud-fog computing architecture. 


5.2.1 Clustering fog level 


The target of the first fog level is to segment the data gathered from IoT devices. 
This process can help to construct parallel trees for each cluster and speed up 
the research since each tree contains only similar objects represented by the root. 
Indeed, it is not required to go through the entire tree to get a response to the 
query; in addition, it makes it possible to launch the query on all fog trees at 
the same time. So to do this, the DBSACN (Density Based Spatial Clustering 
of Applications with Noise) algorithm (Algorithm 1) was chosen to segment data 
into clusters. The DBSCAN is a density-based clustering algorithm designed to 
discover clusters of arbitrary shapes. The main idea of DBSCAN is that, for 
each object in a cluster, the neighborhood of a given radius must have at least a 


minimum number of objects. 


133 


CHAPTER. 5 Parallel Construction of B3CF-trees 


Table 5.1: Definitions of variables used in algorithms. 


Symbols Definitions 

O Set of objects O = {01,..., On, } 

Ws Number of objects in O 

C Set of Clusters C = {C4,...,Cn,} 

Ne Number of cluster 

C. Center of cluster 

d(a, b) Distance function between ob- 
jects a and b 

Te: Number of objects in each cluster 


In our context, DBSCAN seems to be the better choice since the latter can de- 
termine the number of clusters automatically, whereas other clustering methods, 
such as k-means and spectral clustering, require as input the number of clusters, 
which is not always easy to determine when dealing with metric and multimodal 


data. 


The basic version of DBSCAN only allows to group similar elements without de- 
termining a representative for each cluster. This missing information is very im- 
portant whether in the phase of construction of the tree or in the phase of finding 
an element in the tree. Indeed, the representative of a cluster is an optimal choice 
as the root of the tree knowing that a good root allows optimizing the construction 
and the research. In addition, the representative makes it possible to reduce the 
number of comparisons calculated when one wants to select the tree concerned 
by the search. In practice, the attribution of a class to new data, in the absence 
of the representant, is carried out by calculating the distance between the new 
objects and all the elements of the cluster. Creating a representant reduces the 
number of computed comparisons for each cluster to one. For this, a new version 
of DBSCAN is proposed (Algorithm 1) to take into account the previously cited 
requirements. For a better reading of the algorithms, the definitions of all variables 


are grouped in Table 5.1. In Algorithm 1, clustering of the dataset O, consisting of 
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N, Objects o, into n, clusters C with centersC,, is performed using DBSCAN with 


chosen parameters eps and Minpts. 


ooo 
oe 
33 
=e 
oo 
ES] = 


Clustering 
Fog level 


B3CF-tree A zoom of the B3CF-tree 


B&CF-tree 


Figure 5.2: B3CF-tree construction in the cloud-fog computing level. 


135 


CHAPTER. 5 Parallel Construction of B3CF-trees 


Algorithm 1 DBSCAN modified with cluster centers. 
Require: O = {01,...,0n,},eps, Minpts 
Ensure: C’, Cc 
ClusterId = nextId(NOISE) 
for 7 € O.size do 
Point = o.get(i) 
if Point.Clld =UNCLASSIFIED then 
if ExpandCluster(O , Point, ClusterId, Eps, MinPts) then 
ClusterId = nextId(ClusterId) 
end if 
end if 
end for 


for i € {l..n.} do 
calcul Cc; 
end for 


5.2.2 Indexing fog level 


After clustering has been done at the first level of the fog, each cluster will be 
indexed in parallel. The fundamental objective is to allow the construction and 
interrogation of indexes for clusters of data independently and simultaneously. 
The aim is to create a dense cluster of objects with small size. To improve the 
execution time on search algorithms and construction algorithms, compared to our 
last proposal [5] and also the last existing technique. The aim of the first fog level 
is to segment the data gathered from IoT devices. This process assists to construct 
parallel trees for each cluster and speed up the research since each tree contains 
only similar objects. which limits the volume space, excludes the empty sets; the 
separable partitions, does not contain objects and creates eXtended regions that 
will be inserted into a new index. This problem was mentioned by the authors 
in the field (cruse of dimensionality). The distribution of data has to be almost 
balanced between all fog nodes. The B3CF-tree (Figure 5.2), a Binary tree based 


on Containers at the Cloud-Clusters Fog computing level, is strongly inspired by 
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the BCCF-tree [5] and GHB-tree [262] that it tries to improve the performance 
of the construction and search algorithms of the latter. Space partitioning is a 
technique that leads to simpler data structures - and thus algorithms. Moreover, 
the problem of the exponential increase of volumes in large spaces pleads in favor 
of techniques allowing to reduce or at least to limit the volumes, even to control 
their occupation, and this is guaranteed by the clustering algorithm DBSCAN. It 
is based on a partitioning of each cluster, in the metric space, into two regions 


using two balls at a time. 


For the balls construction, we choose two objects and consider them as two pivots 
(Figure. 5.2). The distance between these two pivots is also the radius of the two 


balls. 
The B38CF-tree nodes - or only N - is defined by: 


e L leaf node a set of indexed objects: E;, C FE where |Ez| < Cmaz- 


e N Internal node is a septuple: (p1, p2,7,71,72, Ni, Nz) € 0? x R® x N?. 


where: 


— r =d(pj,p2) helps to define two balls B, and By. According to figure 
5.3: Bi(pi,r) and Bo(po,r), centered on p; and po respectively and 
having a common radius value, large enough for the two balls to have 


a nonempty intersection. 


— r, and rg are the distances to the farthest object in the subtree rooted 


at that node N with respect to p; and po, respectively. 


— N, and No are two subtrees(Figure5.3), such that: N; = {o € N: 
d(pi,0) <= d(p2,0)} and No = {0 € N:: d(p2, 0) < d(pi,o)}. 
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Partition 2 


Figure 5.3: Partitioning the space in the B3CF-tree. 


5.2.2.1 B3CF-tree build 


The construction of the B3CF tree is an incremental process. Algorithm 2 presents 


a formal description of the parallel index construction process. 


Algorithm 2 Parallel B3CF-tree build (C;, n..) 
Build B3CF-tree (€ P()) E N 


With: 
(p1, P2)= The two farthest pivots 

dk if S=2 
a, | +44) if S={e} 
= P1, P2 


BuildB3C F({e € S: d(pi,e) < d(pa,e) 
d(pi,e 


}\{pl}) |] else 
BuildB3C F({e € S:: d(po,e) < d(pr,e)} \ {pa 


p2}) 


The insertion of objects is done from top to bottom (Algorithm 3). Initially, the 
tree is empty (a leaf encompasses a cluster that contains a set of objects). The 
farthest two-pivot search algorithm is used for all objects. We have considered 
putting in place strategies to try to balance the tree, such as choosing two elements 
furthest apart from each other. After the container will be divided into two non- 
overlapping subsets so that each element of the container belongs to its nearest 


pivot. Then, this leaf is replaced by an internal node with p, and pz, and two leaf 
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nodes are created (Figure 5.3). 


The data collected at this level of the fog was aggregated into clusters, using the 
clustering algorithm DBSCAN, before being indexed in parallel with B3CF-tree. 
The parallelism is an efficient tool to speed up the index construction time as well 


as the search algorithm. 


Algorithm 3 Insertion in B3CF-tree 


o0€ 0, 
NEN, 
Insert-B3CF-tree deOxO Rt, EN 
Cmax C N*, 
(o, L, L) ifN=L 
(pi,0, L, L) iN. = (7 al SL) 


(p1, P2, Insert(0, d,Cmaxz,.Ni),N2) if N = (pi, po,7,71, 72, Ni, No) 
Ad(pi,0) <r A d(p2,0) <r 

(Pi, P2, Ni, Insert(o, d, Cmax, N2)) if N = (py, 2,7; 71,72, M1, No) 
Ad(pi,0) <r A d(p2,0) >r 


\|> 


The complexity of the B3CF-tree construction is calculated as follows: using the 
DBSCAN algorithm on the dataset of size n results in clusters of different sizes 
Noe <n. Since the index construction is performed in parallel using Algorithm 2, 
the complexity can be considered as O(m.logm) where m is the average size of 
the resulting clusters. Moreover, the complexity of DBSCAN is O(n.d) where d is 
the average number of neighbors while the original DBSCAN had O(n) memory 
complexity [119],[269]. Thus the overall complexity of our approach is O(n.d) + 
O(m.logm). Similar to the BCCF-tree [39], the construction follows a balanced 
hierarchical partitioning of a set of data clusters. The volume of regions becomes 
smaller, which automatically leads to a lower overlap rate, which in turn will 


improve the performance of the search algorithms. 
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5.2.2.2 Parallel kKNN seach in B3CF-tree 


Similar to the indexing process, parallelism is also used in this work in the similarity 
search query process to minimize retrieve time. The formal description of the kNN 
search in the B38CF-tree is summarized in Algorithm 4. The aim of the k-nearest 
neighbor search is to find the set A of objects closest to a query point g. The kNN 
search algorithm starts with a query radius r, initialized to +oo which should 
lead to scanning the dataset and then decreases by traversing each tree which 
corresponds to the distance to the k® object in the ordered list A. Comparing 
the distances d; and dz between the query point gq and the two pivotsp, and p2 
respectively with rg indicates the descent of the query point in the index. The leaf 
nodes contain a subset of the indexed data with a maximum cardinal Cyq,. To 
find the k nearest neighbors of a leaf, we simply sort the indexed data according 
to their increasing distances to the query g. As a result of the search, the first k 


sorted objects are returned. 


Because of the parallelisation of the kNN search in all B3CF-trees, the complexity 
of the kNN search could be reduced to the complexity of search in only one B3CF- 
tree and it is in the order of O(a../m. log(k)) + log(m)/(a.k../m) where m = 
Max(Noc) is the maximum size of the resulting DBSCAN clusters and a@ = nox //m 
is the ratio of the number of objects in all visited leaves n,,, to the maximum 
cardinal Cmax = /m. The first term O(a../m. log(k)) is the order of the complexity 
of the computations performed in the leaf while the second term log(m)/(a.k../m) 
estimates the complexity of the computations performed when traversing the index 


from the vertex. 


Our proposed approach may have an important impact on JoT data processing 
due to the closeness of the fog layer from the end user. Any set of heterogeneous 


IoT data will be able to be indexed using our proposed method because first, 
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it is developed in metric space and second, the use of DBSCAN separates data 
into clusters of homogenous contents that will be indexed in parallel. The use of 
parallelism during the data indexing and kNN query search, will speed up both 


the indexes construction and the similarity search process. 


Algorithm 4 Search-kNN in B3CF-tree 


NEN, 
gE R’, 
k EN’, Ys re 
kNN-B3CF d:0x0 Rt, €(R 0) 
rg € Rt = +00, 
Ae (R+x 0)N =O 
with : 
e A = ((di, 01), (da, 02), . Siti (dy, Oxr)) 
ed, = d(p1,q) ; 
ed y= d(po, q) ) 


e Ci = B(g,rg) N B(pi,r) #9, for the intersection ; 


e Cy = B(grg) A B(pi,r) 4 OA Bla, rg) N Blp2,r) 4 O, for the partial ball 
centered on py ; 


e C3 = B(g,rq) 1 Blpi,r) F OA Bla, rg) N B(p2,r) FY, for the partial ball 
centered on pz ; 


ay ew 
— Co = true ; 

Teg = min 7d} if ko = Keelse ry ; 

— A; = kNN-B8CF(M,, ¢, k, rq,_,, Asi) if Ci else Aj_1 ; 

— rg, = min{r,, ,,de} if |A;a| = & A Aja = ((d1,01),---, (de, 0%)) else 


Vai: 
AJ A,ksort(AU {(d(o,¢),0):0€ L}) ifN=L 
Ag if N = (pi, p2,r, Ni, No) 
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5.3. Simulation and Results 


To test and compare the effectiveness of the proposed approach many experiments 
were performed on five real data sets with different sizes and dimensions (Table 
5.2). The size, or the number of vectors, represents the number of lines in the 
database while the dimension represents the number of values in each line (Vector 
coordinates). The databases have been carefully selected from among others to 


bring together most of the problems encountered in indexing IoT data. 


Table 5.2: Characteristics of the selected datasets for the index evaluation. 


Dataset Size (Vectors) | Dimension 
Geographical coordinates 988 2 
GPS trajectory 18107 3 
Tracking a moving object dataset 62702 20 
WARD (Wearable Action Recognition Database) 1000000 5 
Smart Home data 5000000 4 


1. Geographical coordinate database: a real dataset of 988 2D vectors, which 
have a low dimensional. It contains BD-L-TC topographic data of selected 


locations and places [270]. 


2. GPS trajectory: a dataset of 18107 3D vectors, containing transport trajec- 
tories in the northeast of Brazil [271]. 


3. Tracking of a moving object: a real dataset of 62702 20D vectors. It repre- 
sents the results of a random simulation of tracking a moving object using 


wireless cameras . 


4. WARD (Wearable Action Recognition Database) [272]: a real dataset of 
1000000 5D vectors. It is a reference database for human activity recognition 


using wearable sensors [273]. 


5. Smart Home data [274]: a real dataset of 5000000 4D vectors. The dataset 


142 


CHAPTER. 5 Parallel Construction of B3CF-trees 


is composed of IoT sensors based on the MQTT communication protocol 


where the scenario is related to a smart home environment [275]. 


The experiments were performed using the Python programming language in- 
stalled on an Intel()CoreTM i7-8550UCPU, 1.80 GHz*8 processor with a 64-bit 
Linux operating system (Ubuntu). The parameters of the DBSCAN algorithm 
Eps and Minpts used for each dataset are regrouped in Table 5.3. In the imple- 
mentation, the used machine is considered as a fog in which, the received data is 


processed following two steps. 


Table 5.3: Parameter values of the DBSCAN algorithm. 
Dataset Eps | Minpts 
Geographical coordinates 0.062 38 
GPS trajectory 70 3 
Tracking a moving object dataset 248 250 
WARD (Wearable Action Recognition Database) | 91 23 
Smart home data 170 30 


In the first step, for data indexing, two codes were implemented: DBSCAN clus- 
tering (algorithm 1) and the parallel build of the B3CF-trees using threads (algo- 
rithms 2 and 3). In the second step, the parallel kKNN query search is implemented, 
in threads, using the code of algorithm 4. The effectiveness of the proposed B3CF- 
tree construction and the query response are tested by comparing our obtained 


results to those obtained by the following index structures: 


e BCCF-tree (Binary tree based on containers at the cloud-fog computing 
level) [5]: The index B3CF-tree proposed in this study represents an im- 
provement of this index since there is no overlapping of objects when using 


DBSCAN algorithm for clustering in the metric space. 


e [WC-tree (Indexing tree without containers) [5]: The comparison of our 
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results with those of this index can show the effectiveness of using containers 


in binary trees. 


e MX-tree [247]: The comparison of our results with those of this index can 
highlight the difference between hyper-plane partitioning and ball partition- 


ing in the metric space. 


e BB-tree (Bubble Buckets tree) [176],{201]: This index is constructed in the 
multidimensional space, a comparison with our proposed index shows the 


difference between the metric space and the multidimensional space. 


5.3.1 Evaluation and comparison of the index construction 


The evaluation of the construction index of the B3CF-tree is based on the number 
of computed distances, the number of comparisons, and the construction time 
(Figure 5.4) where the size of the containers is set by Gnaz = Vn. From the 
obtained results presented in Table 5.4, One can see that the [WC-tree has no 
results for Smart Home data which reflects the degradation of this index when 
the data sizes are larger than five million. This is due to the fact that the IWC- 
tree proceeds with the whole dataset, unlike the B3CF-tree and BCCF-tree which 
proceed with partial data using means of containers. In the BB-tree index, the 
balls also act as containers. In the MX-tree, each node has a maximum capacity 


beyond which it will be divided into two nodes. 


5.3.1.1 Number of calculated distances 


As shown in Figure 5.4, the number of distances computed during the construction 
of all index structures changes with the size and dimension of the data sets. The 
number of distances computed when constructing the proposed B3CF-tree (taken 


as the sum of number of distances for all clusters) is less than that of the BCCF 
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Figure 5.4: Number of distances, number of comparisons and construction time of 
B38CF-tree, BCCF-tree, BB-tree, MX-tree and IWC-tree. 


tree, the MX tree, and the IWC tree. In the BCCF-tree, k-means is used for the 
determination of the two pivots during the index construction and this is what 
increases the number of distances. In the proposed B3CF-tree, the two pivots are 
always chosen as the most distant objects. The number of distances calculated 
during the construction of the B38CF-tree is good compared with other methods 
except for BB tree because the BB-tree is constructed in the multidimensional 
space where the data are directly partitioned without calculating distances. The 
construction of the B3CF-tree is very efficient thanks to DBSCAN algorithm which 
allows a good data grouping. Indeed, the clusters have the same density and the 


objects are similar. 


5.3.1.2. Number of comparisons 


The number of comparisons calculated when constructing the B3CF-tree (also 
taken as the sum of number of comparisons for the resulting clusters) is lower than 
that of the other index structures (Figure 5.4) regardless of the space in which they 
are constructed. This is due to the use of the DBSCAN algorithm for clustering 


145 


CHAPTER. 5 Parallel Construction of B3CF-trees 


which divides the dataset into clusters of similar (or nearest) objects. For BCCF- 
tree, the number of comparisons increases due to the use of the whole dataset to 
build the tree while the BB-tree scored the greatest number of comparisons despite 
the low number of calculated distances. This is directly related to the construction 
method which is based on putting the objects to the left or to the right of an axis 


determined from the medium calculations. 


5.3.1.3. Construction time 


On Figure 5.4, the construction time of the B3CF-tree (considered as the average 
time for indexing the resulting cluster data) is less than that of the BCCF-tree 
and the IWC-tree and close to that of the BB-tree and the MX-tree. For example, 
the ratio of the build time of the B3CF-tree to that of the BCCF-tree is 0.02% for 
the geographic coordinate data, 0.8% for the GPS trajectory data, 1.8% for the 
tracking dataset, 0.47% for the WARD database, and 0.012% for the smart house 
data. The difference in construction time in the BCCF-tree may be due to the 
increase in the number of distances, likely related to the use of k-means for pivot 
determination. For the [WC-tree, since it does not use containers, the distance 
between objects in all datasets is calculated. Moreover, the parallel construction 
of indexes from DBSCAN clusters implies an efficient reduction in construction 
time because the overall size of the dataset is divided over DBSCAN clusters. The 
indexes construction results confirmed the performance of our proposed approach 
after comparison with its competitors. Indeed, regrouping the dataset into clusters 
allows the use of parallelism during the indexing process. In addition, for each 
B3CF-tree, the choice of pivots as the farthest objects during the partition of data 
in containers is a simple process but efficient if compared with the k-means method 
(used in BCCF-tree). The clustering using DBSCAN algorithm before indexing 


the data , the parallel indexing and the simple manner for the choice of pivots in 
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Table 5.4: Values of the number of calculated distances, the number of comparisons and the construction time. 


Number of distances 


Number of comparisons 


Construction time(s) 


B3CF-tree | BCCF-tree | BB-tree | MX-tree | IWC-tree | B3CF-tree | BCCF-tree | BB-tree | MX-tree | IWC-tree | B3CF-tree | BCCF-tree | BB-tree | MX-tree | IWC-tree 
Geographical | 4 935103 | 1.646404 | 2.008401 | 6.868403 | 1.68h+404 | 2.468403 | 8.226403 | 2.95h+04| 5.46B+03 | 8.398403 | 4.008-04 1.66 1.43 3.27 2.33E+03 
coordinates 
ure 239E+06 | 4.3E+06 | 2.60E+01 | 1.46E+06 | 1.63E+08 | 1.19E+06 | 2.46E+06 | 2.39E+06 | 1.43E+06 | 8.16E+07 | 6.19E-02 7.65 38.80 | 1.32E+03 | 2.05E+04 
Trajectory 
Tracking 1) 4 7 im 5 5 4 InP mn p 6 ¢ 5 G 5 < 
Databae | 3:96E+05 | 1.646+06 | 1.80E+01 | 1.356+06 | 1.94E+07 | 1.98E+05 | 8.20E+05 | 1.30E+07 | 1.24E+06 | 9.70B+06 | 9.20B-03 0.50 4.27 | 1.06E+02) 1.75E403 
WARD 1.405107 | 647E107 |2.60E101 | 1.91b107 | 5.13107 | 7.00H106 | 3.23E107 |7.57E108| 1.76b107 | 2.565107 | 7.80E-03 1.67 4.20 26.70 | 2.70E+03 
wasseee 1.30E+09 | 6.19E+09 | 5.38E+02 | 1.92B+09 6.51E+08 | 3.10E+09 | 7.01E+09 | 1.91E +09 2.09E-02 | 1.80B+02 57.10 67.10 
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the containers contributed efficiently in the reduction of the construction cost. 


5.3.2 Evaluation and comparison of the constructed index 
quality 


To check the quality of the constructed B3CF tree, the number of nodes per level, 
the distribution of data in the leaf, the number of internal nodes, the number 
of leaves, and the tree height features were examined compared to the BCCF- 
tree, BB-tree, MX-tree and IWC-tree. Table 5.5 lists the values of the last three 


features. 


5.3.2.1 Number of nodes per level 


The number of nodes per level varies according to the dataset as shown in Figure 
5.5. It is constant for the GPS trajectory and Smart Home datasets and varies from 
level to level for the other datasets. The number of nodes per level is plotted for 
three clusters (result of the DBSCAN algorithm) in the Geographic Coordinates 
and GPS trajectory datasets. However, for the WARD and Smart Home datasets, 
the DBSCAN algorithm gave more than three clusters and therefore only three 
clusters were chosen to present the results. According to Figure 5.5, the proposed 


B3CF-tree index structure is efficient for computing very large data. 


5.3.2.2 Data distribution in leaves 


Figure 5.6 shows the data distribution in the leaves. The B3CF-trees in each 
cluster of the Tracking and WARD datasets are balanced. This is because the 
data for both datasets are well distributed between the left and right sides of each 
tree. For the Geographic Coordinates dataset, only the index of the first cluster 


is balanced. Our index is very efficient, and this is because the space is divided 
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Figure 5.5: Number of nodes per level in the B3CF-tree. 


into two sub-parts that do not intersect. This method ensures that the nodes do 
not overlap. In addition to this, the application of DBSCAN to the first level data 
makes the data in each group similar and close to each other, which makes the 


tree composition balanced. 


5.3.2.3. Number of internal nodes 


The number of internal nodes in the B3CF-tree is lower than that of other index 
structures (Figure 5.7). The number of internal nodes in the IWC-tree structure 


is high because it does not use containers that control the partitioning of data. 


5.3.2.4 Number of leaf nodes 


The same observations can be made for the number of leaf nodes with respect to 
the relationship between it and the number of internal nodes (Figure 5.7). The 
number of leaf nodes in the B3CF-tree is lower than in the other structures because 


the use of the DBSCAN algorithm implies the grouping of the closest objects in 
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Figure 5.6: Distribution of data in the B3CF-tree. 


the same leaf. 


5.3.2.5 Tree height 


The height of the B3CF-tree varies from one data set to another. It is close to those 
of the other structures for the geographic coordinate and GPS trajectory datasets 
and higher than those of the other structures for the other datasets. The high 
height of the B3CF tree reflects the effectiveness of clustering using the DBSCAN 
algorithm in partitioning the data. After analyzing and comparing the statistical 
results of the construction and quality of the B3CF-tree, it can be deduced that the 
proposed index structure performs well. This is due to the use of a combination 
of the DBSCAN algorithm in clustering and the parallelism method during the 
construction of the cluster index. This combination allows a fast construction of 
the index without overlapping nodes. Indeed, the use of the DBSCAN algorithm 
guarantees the creation of clusters without overlapping data. On the other hand, 


when building the B3CF-tree, the choice of the two most distant objects, inside the 
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Figure 5.7: Number of internal nodes, number of nodes leaves and height of B3CF- 
tree, BCCF-tree, BB-tree, MX-tree and IWC-tree. 


Table 5.5: Values of the number of internal nodes, the number of the nodes leaves 
and the height of the tree. 


Number of internal nodes Number of nodes leaves Height of the tree 
B3CF | BCCF | BB | MX | IWC | B3CF | BCCF | BB | MX | IWC | B3CF | BCCF | BB | MX | IWC 
tree tree tree | tree tree tree tree tree | tree tree tree tree tree | tree | tree 
Seeerapiieal | ag 45 | 44 | 44 | 341 | 39 46 | 45 | 45 | 288 | 14 8 | 12] 7 | 15 
coordinates 
GPS 3 : = : 
; 264 266 233 | 232 9052 267 267 243 | 233 2 267 17 22 | 30 | 9052 
Trajectory 
Areclsne 339 | 425 | 384 | 434 | 24077 | 393 | 426 | 385 | 435 | 10005 | 276 | 29 | 25 | 35 | 399 
Database 
WARD 1489 | 2108 | 1489 | 1507 | 348109 | 1741 2109 | 1490 | 1508 | 220874 | 263 267 37 | 129 | 69 
Spier 1667 | 4311 | 2577 | 3626 = 1678 | 4312 | 2578 | 3627 - 1678 1226 44 | 818 - 
Home data 


containers, as pivots guarantees the partitioning of the space into two parts, which 
ensures the non-overlapping of the nodes and the good balancing of the index. All 
these criteria can allow a fast search when searching the similarity query. To test 
the effectiveness of our B38CF-tree, the results of the kNN search will be presented 


and discussed in the next section. 
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5.3.3 Evaluation and comparison of the kNN search 


For the evaluation of KNN search with k = 5,10,15,20,50, and 100 in the pro- 
posed B3CF-tree index structure, the number of distances, number of compar- 
isons, search time and number of visited leaves will be determined to reach the 
100 queries. When examining the search efficiency of similarity queries, the ob- 
tained statistical results were compared with those of the BCCF-tree, MX-tree, 
BB-tree and IWC-tree indexing structures. Note that all statistical results were 


averaged over 100 randomly generated queries. 


Geographical Coordinates z : : : GPS Trajectory r r r 1 Tracking data 
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Figure 5.8: Number of calculated distances for the kNN search in B3CF-tree, 
BCCF-tree, BB-tree, MX-tree and IWC-tree. 


5.3.3.1 Number of calculated distances 


Figure 5.8 shows the number of calculated distances for number of neighbors k 
between 5 and 100. As can be seen, the proposed B3CF-tree has the smallest 
number of calculated distances compared to the BCCF-tree, the BB-tree, the MX- 


tree and the IWC-tree. We can see, also, that the number of distances calculated 
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in the B3CF-tree (Figure 5.8) is nearly invariant as a function of the number of 
neighbors k between 50 and 100. For all databases used in this evaluation, the 
ratio between the number of distances of k = 50 and k = 100 varies between 1.00% 
and 2.00%. This result reflects the efficiency of the parallel search in our proposed 
structure. Even the number of calculated distances is high in the BCCF-tree, it is 
not affected, for some databases, by the increase of the number of neighbors k like 
the BB-tree, the MX-tree and the IWC-tree. Table 5.6 summarizes the values of 


the number of calculated distances. 
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Figure 5.9: Number of comparisons calculated for the kNN search in B3CF-tree, 
BCCF-tree, BB-tree, MX-tree and IWC-tree. 


5.3.3.2. Number of calculated comparisons 


As can be seen in Figure 5.9, for all the used datasets, the lowest number of com- 
parisons corresponds to the proposed B3CF-tree, except for the GPS trajectory 
data where the [WC-tree has the lowest number of comparisons and almost con- 


stant regardless of the value of k. However, for the same dataset, when we compare 
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the variation of the number of comparisons in the B3CF-tree as a function of the 
parameter k with those of the other index structures, one can observe that the 
variation of the number of comparisons in the B3CF tree follows a saturation law 
which indicates that the number of comparisons stabilizes for a value of k greater 
than 100. This is not the case for the BB-tree, for example, where the evolution of 
the number of comparisons follows an exponential law. Even though the number 
of comparisons in the MX-tree is lower than in our B38CF-tree, it increases about 


10 times when k = 100. Table 5.7 lists the values of the number of comparisons 


calculated. 
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Figure 5.10: Time of kNN search in B38CF-tree, BCCF-tree, BB-tree, MX-tree and 


IWC-tree. 
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5.3.3.3. Time of search 


According to Figure 5.10, the proposed B3CF-tree has the lowest search time com- 
pared to the BCCF-tree, the BB-tree, the MX-tree and the ICW-tree structures. 
The ratio of the search time of the BCCF-tree to the B3CF-tree is 0.07% for the 
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geographical coordinates data, 0.55% for the GPS trajectory data, 0.06% for the 
tracking dataset, 0.13% for the WARD database and 0.55% for the smart home 
data. We observe that the value of k has no influence on the performance of the 
search algorithm. In the B3CF-tree, the ratio of the search time of k = 50 and 
k = 100 is 1.56% for the geographical coordinates data, 1.74% for the GPS tra- 
jectory data, 1.82% for the tracking dataset, 2.07% for the WARD database and 
1.07% for the smart home data. Our proposed index exhibits the shortest search 
time not only by comparison with the chosen structures, but also, by comparison 
with other indexes. For example, according to Zhang et al.{185], the combination 
of the R-tree and KD-tree (EEMINC) answered the point query in 1 thousand 
nodes and 10 million records in time between 40 and 50 ms and, according to Hu 
et al. [210], the execution of the 8 nearest neighbours query on the hierarchical 
index method of 5.5 billion points takes 8.49 s. For smart home data of 5 millions 
vectors (Figure 5.10), the B3CF-tree answers the average of 100 queries in a time 
between 0.0016 and 0.006 s. This indicates that the use of clustering coupled with 
parallelism significantly improves the efficiency of the kNN search by decreasing 


the search time. The search time values are grouped in Table 5.8. 
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Figure 5.11: Number of the visited leaves in B3CF-tree, BCCF-tree, BB-tree and 
MxX-tree. 


5.3.3.4 Number of the visited leave 


Figure 5.11 shows the number of leaves visited during the kNN search in the B38CF- 
tree, BCCF tree, BB tree and MX tree. Note that the number of visited leaves is 
invariant to the number of neighbors k which is between 5 and 100. The IWC tree 
is not considered because this structure does not support kNN search [5]. Indeed, 
this structure presents poor results according to Figure5.8, 5.9 and 5.10. As can be 
seen in Figure 5.11, the B3CF-tree presents the smallest number of visited leaves 
and that is why the search time in our index is low. This is due to the use of 
DBSCAN algorithm for clustering which induced non-overlapping clusters. Figure 
5.11 shows the number of the visited leaves during the kNN search in B3CF-tree, 
BCCF-tree, BB-tree and MX-tree. It is to notice that the number of the visited 
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Table 5.9: Average number of the visited leaves in B3CF-tree, BCCF-tree, BB-tree 
and MX-tree. 


Geographical 


Number of the visited leaves GPS Trajectory | Tracking Database | WARD | Smart Home data 


Coordinates 
B3CF-tree 6.025 89 5.465 4.36918 | 120.36 
BCCF-tree 11.65 267 163.84 1513.39 | 359.04 
BB-tree 45 234 385 1000000 | 2578 
MX-tree 28.74 160 435 1320.32 | 2400.25 


leaves is invariant as a function of the number of neighbors & which is between 5 
and 100. The IWC-tree is not considered because this structure does not support 
the kNN search [5]. Indeed, this structure presents poor results according to Figure 
5.8, 5.9 and 5.10. As can be seen in Figure 5.11, the B3CF-tree exhibits the lowest 
number of the visited leaves and that is why the time of search in our index is 
low. This is due to the use of the DBSCAN algorithm for clustering which induced 
no-overlapping clusters. The number of the visited leaves is regrouped in Table 


5.9. 


5.4 Conclusion 


This chapter presented a new indexing structure called B3CF-tree(Binary tree 
based on Containers at the Cloud-Clusters Fog computing level) the indexing pro- 
cess is delocalized from the cloud to the fog nodes to get the data near the indexing 
structure and thus reduce the network traffic congestion significantly. Moreover, 
each fog node creates its unique indexing structure, allowing not only parallelism in 
tree construction, but also parallelism in the search process by launching the same 
query simultaneously on all fog nodes. Second, a post-index process is performed 
in each fog node. It partitions the data into similar groups using the DBSCAN 
algorithm. The aim of this process is to generate a balanced tree with a reduced 
degree of overlapping between the leaves of the tree. Indeed, DBSCAN is a density- 
based clustering method that is distinguished by its ability to automatically create 
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clusters with almost zero inter-class similarity. 
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6 CV Method for Indexing Contin- 


uous loT’ Data 


6.1 Introduction 


In the previous chapter, we have presented the B3CF-tree that is tested for a 
unique data stream. IoT data from devices are continuously generated in multi 
types such as textual, numerical, streaming and multimedia data [199]. Storing 
this continuous streams of IoT data and finding an efficient retrieving method is a 


big challenge regarding the dynamicity and the diversity of types and dimensions. 


In this chapter, in order to index continuous stream of IoT data and finding an 
efficient retrieving method. We propose an effective approach, in the fog-cloud 
computing level, to organize and store continuous IoT data stream and make 
rapid the similarity query search. Because it is collected from different devices, 
in the terminal layer, IoT data is characterized by heterogeneity, noise, diversity 
and rapid growth [85]. For the organization of each IoT data stream, the fog 
layer is divided into three levels: clustering fog level, clusters processing fog levels 
and indexing fog level. In the clustering fog level, DBSCAN is used for clustering 
because it is the most suitable algorithm for grouping diverse IoT data into homo- 


geneous and high density clusters. Each cluster of the first data stream is stored 
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in the clusters processing fog level and directly indexed in a BH-tree (Binary tree 
with Hyper-plane) in the indexing fog level. For the arrival data streams, after 
DBSCAN clustering, the indexing is based on the comparison of the coefficient of 
variation (CV) value of the arrival cluster and those of the union of the arrival 
cluster with the existing clusters in the clusters processing fog level. According to 
the minimum value of CV, the arrival cluster is directly indexed in a new BH-tree 


or, is inserted in an existing index. 


The proposed approach will be detailed in what follows. The simulation and re- 
sults will be presented and discussed by the comparison with two other scenarios 
.The first scenario is called Creation of a New Index (CNI method) and the second 
scenario is called Insertion in an Existing Index (IEI method). The comparison 
in terms of the number of calculated distances, the number of calculated com- 
parisons and the construction time during the trees construction process. The 
same parameters were tested and compared with two other scenarios in the kNN 
query search method. The consumed energy during the parallel kNN search is also 
presented and discussed. For the storage and the indexing of the continuous IloT 
data stream, we benefit from the cloud-fog computing architecture (Figure 8.1). In 
the terminal layer. IoT devices, geographically distributed, generate continuously 
large and diverse data. The indexing of this continuous data stream is proceeded 
in the fog-computing layer because of its numerous characteristics such as the re- 
duction of the latency of services, the providing of real-time applications and the 


capacity of processing of high number of nodes [53] 


6.2 Proposed Approach 


For the storage and the indexing of the continuous IoT data stream, we benefit 


from the cloud-fog computing architecture (Figure 8.1). In the terminal layer. IoT 
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devices, geographically distributed, generate continuously large and diverse data. 
The indexing of this continuous data stream is proceeded in the fog-computing 
layer because of its numerous characteristics such as the reduction of the latency 
of services, the providing of real-time applications and the capacity of processing 
of high number of nodes [53].In this work, the fog layer is divided into three levels: 
the clustering fog level, the clusters processing fog level and the indexing fog level 
(Figure 6.1). In the clustering fog level, each data stream, from the terminal layer, 
is grouped into homogenous clusters. Clusters of the first data stream are stored in 
the clusters processing fog level and their objects are directly indexed in separated 
BH-trees in the indexing fog layer. For the arrival data streams, according to the 
Coefficient of Variation (CV) value of their clusters, in the clusters processing fog 
level, a new BH-tree will be constructed or objects of the arrival cluster will be 


inserted in an existing BH-tree. 


The processing capabilities of fog nodes are not affected by the additional work 
introduced in each layer since the amount of sensors installed will automatically 
give rise to a suitable type of hardware to capture, process and transmit data 
from the sensors. This means that a large number of sensors implies additional 
power from the fog (This condition is ensured in the installation process). In addi- 
tion, fog’s three-level architecture with specialisation of each level allows smoother 


processing. 
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Figure 6.1: Architecture of the CV method for indexing continuous IoT data. 


In what follows, a detailed description of the clustering, the CV and the indexing 
methods will be presented. The definitions of the used parameters are regrouped 


in Table 6.1. 


6.2.1 Clustering method 


In the clustering fog level, each data stream, sent by the terminal layer, is collected 
and grouped in N clusters Cl, with {n = 1..N} using the DBSCAN algorithm 
(Density-Based Spatial Clustering of Applications with Noise) [276] modified by 
the introduction of the calculation of the clusters centers, noticed in this work c,,, 
for the coefficient of the variation (CV) calculation. Each cluster Cl, contains 


similar elements. 


The triggering of the clustering process is closely linked to the storage capacity of 
the fog node since the fog nodes do not have the same storage and processing ca- 
pacities. This condition makes it possible to go beyond congestion and conceptual 


bottleneck and allows tailoring processing with the capabilities of the fog node. 
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DBSCAN algorithm is one of the most used data clustering method [277]. This 
algorithm is based on the connection of points within specific distance threshold. 
However, it connects only those points that satisfied a density threshold (mini- 
mum number of objects in a radius). The DBSCAN algorithm partition the data 
into clusters of arbitrary shapes. Each cluster contains all the objects that are 
connected by the density. The choice of this clustering method came from the fact 
the DBSCAN clusters are automatically formed while the k-means algorithm, for 
example, requires the determination of the number of clusters before clustering. 
Also, the DBSCAN algorithm is robust in the detection of outliers which are con- 
sidered as objects that wait for other similar objects in the next data stream. The 
complexity of the DBSCAN algorithm for grouping a dataset of o objects into N 
clusters is O(o.d) [278] where 0 = oc; + 0cz +... + ocy which could be written 
as o = N.mean(oc), where oc is the number of objects in per cluster, and d is 
average number of neighbours. That gives us the final form of the complexity of 


the DBSCAN algorithm for each data stream which is O(.N.mean(oc).d). 


6.2.1.1 CV method 


In the clusters processing fog level, the coefficient of variation (CV) is used as a 
criterion to decide if a cluster of the arrival data stream is to be inserted in an 
existing BH-tree or indexed in a new BH-tree. The coefficient of variation is a 
statistical measure of the dispersion of data points in a dataset around the mean. 
It represents the ratio of the standard deviation to the mean. The advantage of 
the use of the coefficient of variation is that it is not sensible to the data type and 
dimension [279]. The clusters processing fog level contains clusters of the first data 
stream Cl,. In this fog level (Figure 6.2), each cluster of the arrival data stream 


Ci. is unified with a copy of all the existing clusters Cl, (Algorithm 5). 
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Table 6.1: Table of notations. 


Abbreviation Explantation 

N Number of the first clusters 

dK Number of the arrival clusters 

Clr, {n = 1..N} | Clusters of the first data stream 

Cl,,, {k =1..K} | Clusters of the arrival data stream 

Gat = Ny Cluster centers of the first data stream 

C,, {k =1..K} | Cluster centers of the arrival data stream 

Cl,UCh, Union of the arrival clusters Cl, and the first 
clusters Cl, 

d(Cny Cy) Distance between two centers 

In, {n =1..N} | Set of indexes 

Ming Minimum distances between the centers 
of the existing clusters and the incoming clusters 

P1,P2 Pivots 
Set of elements 

LN Leaf node 

IN Inner node 

O Object 

L Left sub tree 

R Right sub tree 

q Query 

ie Radius for recovering k objects closes to q 

A List in with, the set of k objects is stored 

B(q,1q) Query ball g with radius r, 


After that, the CV of the cluster of the arrival data stream CVor, and the CV of 
the union of this cluster with every existing cluster CVorluctn are determined. If 
the cluster of the arrival data stream Cl, has the minimum value of CV, a new 
BH-tree is constructed, in the indexing fog level, and the cluster Ci, is stored with 
the existing clusters Cl, in the clusters processing fog level. If the minimum value 
of CV correspond to the union of the cluster of the arrival data stream with an 
existing cluster Cl;, U Cl, objects in the arrival cluster Cl, are inserted in the 


BH-tree of the corresponding existing cluster Cl,. 
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BH-tree 1 BH-tree 3 BH-tree n BH-tree 1 BH-tree 3 BH-treen 


New BH-tree BH-tree 2 BH-tree 4 BH-tree 2 | BH-tree 4 
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4 


' Construction of a new BH-tree with data / Insertion of the arrival data in a 
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© . ¥ 
( 
. 
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Arrival cluster 


Cvmin} 


So Caua > 


Arrival cluster 


Figure 6.2: CV method in the cluster processing level. 


Because the CV calculation of the union of one arrival cluster with the first clus- 
ters is parallel, the complexity for all clusters is taken as the complexity for the 
CV calculation of the cluster with a maximum number of objects OCmaz, which 
represents approximately 2mean(oc) and it is given by O(mean(oc)). Due to the 
fact that the comparison of N arrival clusters with existing clusters is sequential, 
the complexity of the CV method for each data stream is O(N.mean(oc)). The 
CV method processes the clusters and not the data themselves, this makes it pos- 
sible to considerably reduce the processing time, despite a polynomial complexity, 
because the number of clusters is negligible compared to the number of data. This 
is due to the capacities of the DBSCAN method which allows to detect all the 
clusters, even if they have a convex shape. Indeed, only the true clusters were 


taken into consideration by the method, the others are judged as noises. 


6.2.1.2 Indexing method 


In the indexing fog layer, the used Binary tree with Hyper-plane (BH), simi- 


lar to the B3CF-tree [280], is based on a recursive division of the space, by an 
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Algorithm 5 CV method 


Require: Cl = {Cl,..Cln,n = 1..N} Cl’ = {Cl,..Cl,,k =1..K} 
Ensure: I[,, 
for each data stream do 
for cl’ € Cl’ do 
CV. «Calculate the coefficient of variation of the new cluster (cl’) 
for cl € Cl do 
CV 1 < Calculate the coefficient of variation of (Cl’ U Cl) 
if CV < CV), then 
create new index (cl’) 
else 
insert cl’ in In 
end if 
end for 
end for 
end for 


hyper-plane, into two regions through two pivots p;, pz chosen as the two farthest 
elements. In the set F, elements closer to p,; belong to the first region while those 
closer to pz belong to the second region. This results in avoiding the overlapping 
of regions when answering queries. Firstly, a leaf node LN contains a subset Ey, 
of objects with Ezy C E. Secondly, an inner node IN consists of two elements 


and two children: (p;,p2,L,R) € 0? x IN®. That is : 


© ~1,p2 are two unconfused objects, d(pi,p2) = dmar, called "pivots". They 
define the hyper-plane. 


e L is a left sub-tree and R is a right sub-tree. 


The construction of BH-tree is realized incrementally.The insertion is top-down. 


6.2.1.2.a Parallel kNN similarity queries search 


The parallel KNN method is adopted for the similarity query search in BH-trees 


because the add of an arrival cluster to the first clusters induce a re-computation 
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of the Delaunay graphs in the cloud and in all fogs which may cause a latency of 
continuous IoT data indexing process. The search algorithm gives an answer to 
the query q with radius r, to recover the k objects closest to q (Algorithm 6). The 
set of k objects is stored in the list A. To address the queries, we apply the kNN 
algorithm on the BH-tree by starting from the root to its leaves. The search is 
performed by calculating the distance between the query point and the two pivots 
pi OY p2, going down the tree and determining whether the search should continue 
in the left branch L or the right branch R. We start the query with a radius 
Yq = +oo and then, decrements by traversing each sub-tree that corresponds to 
the distance to the k° object in the order list A. To make the kNN search more 
efficient, parallelism is also used in this work in the similarity search query process 
to minimize retrieve time [278]. Indeed, the complexity of the kNN search in all 


indexes could be reduced to the complexity of search in only one index. 


To test the efficiency of our proposed approach, the CV method will be confronted 
to two other scenarios. For these scenarios, the fog layer contains only the cluster- 
ing and the indexing levels. The first scenario is called Creation of a New Index 


(CNI) and the second scenario is called Insertion in an Insertion in an Existing 


Index (IEI). 


6.2.1.2.b CNI method 


In this scenario, objects in clusters Cl’ of the arrival data stream are indexed in 
a new BH-tree. The description of this method is presented in algorithm 7. This 
method is simple and it needs no comparison with the existing clusters or indexes. 


The CNI method results in the creation of indexes of similar objects. 
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Algorithm 6 kNN search in the BH-tree 


IN EN, 

qeR’, 

NNR ee e (Rt x 0) 
d:0x0->R*, 
ra € Rt = +00, 
Aée(Rt x ON =O 

with: 

-(p1, p2,L,R) = IN 

- d, = d(p1,q) 

- dy = d(po, g) 


-B(q,rq) query ball q with radius r, 
if [N == NULL then 
return A 
else 
Calculate the distances d, and dy» 
if |A|<k then 
Tq — +00 
else 
rca 
end if 
for i € (0,1) do 
if dj <r, then 
A+«k-— Insert(k, A, (di, p;)) 
end if 
for each node IN do 
if B(g,rq) NIN #4 @ then 
A«kNN — BH —tree(INi, q,k,d,1rq, A) 
end if 
end for 
end for 
end if 
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Algorithm 7 CNI method 
Require: Cl = {Cl,..Cln,n = 1..N} 
Cl AACL «Cl k= 1,40} 
Ensure: [15 
for each data stream do 
for cl’ € Cl’ do 
create new index( cl’) 
end for 
end for 


6.2.1.2.c IEI method 


In this scenario, objects of each cluster of the arrival data stream are inserted in 
one of the existing indexes. In this method, clusters centers of the first data stream 
Cn are took as representatives of the existing indexes. The choice of an existing 
BH-tree, in which, the objects of the arrival cluster Cl’ will be inserted is basing on 
the test of distances between the arrival cluster center c’;, and the existing BH-tree 
representative centers c, (Algorithm 8). Objects of the arrival cluster Cl’ will be 


inserted in index n when the distance between c’, and c,, is minimum. 


Algorithm 8 IEI method 
Require: Cl = {Cl,..Cl,,n = 1..N} 
Ch AtGh Cl ok = 1k * 
Ensure: [,, 
for each data stream do 
for cl, € Cl’ do 
for cl, € Cl do 
Ming <calculate distances(d(cn, ¢,)) 
insert cl’ in I, 
end for 
end for 
end for 
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6.3. Simulation and Results 


In this section, we firstly describe the experimental parameters, including the 
datasets and the experimental platform. Then, we report and discuss our exper- 
imental results with respect to the evolution of the number of indexes with the 
data stream, the evaluation of the indexes construction and the evaluation of the 


parallel KNN search. 


6.3.1 Experimental setting 


For the experimental evaluation of the four proposed indexing methods, we have 
used three real data sets (GPS trajectory, WARD and traffic datasets) and one 
synthetic dataset (Tracking). Details on these datasets are presented in what 


follow. 
1. GPS Trajectories: Collected from Go!Track Android application [271]. 


2. Tracking dataset: Moving vectors generated by an object tracking simulator 
with wireless cameras in the wireless multimedia sensor network in a random 


simulation [5]. 
3. WARD (Wearable Action Recognition Database) |272|: Database of human 
action reconnaissance using wearable movement sensors [273]. 


4. Traffic dataset: Belongs to the road networks category [281]. 


In order to achieve our data stream simulation experiments, all datasets were 
divided into subsets. These subsets of different sizes and dimensions (‘Table 6.2) 
are considered as data streams. Our experiments were implemented using Python 
software installed in a 64-bit Linux operating system (Ubuntu) of Intel(R)Core TM 
i7-8550U CPU, 1.80 GHz*8 processor, 16GB RAM and 256GB ROM. 
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Table 6.2: Characteristics of the selected datasets for the index evaluation. 


Size of the data 


Size of data 


Dataset Size (Vectors) | Dimension stream( Vectors): | stream (Bytes) 
GPS trajectory 18107 3 4000 115507.02 
Teadeey eyrovine 62702 20 12000 1270493.8 
object dataset 
WARD 3078552 5 600000 18058184 
Traffic dataset 5000000 2 1000000 20132659.2 
GPS trajectory Tracking dataset 
| EE CNI method 
| MEI method aie — as 
25 | MCV method ME CV method 
8 604 
20+ oO 
| 2 | 
15- xe) 404 
; 
10+ = 
Z 204 
— a 
1st 2nd 3rd 4th 1st 2nd 3rd 4th 
Data stream Data stream 
WARD Traffic dataset 
9° F mm GNI method ai ae : F 
HE [EI method seeds 
300 | mm cv etn 100 + — i pathos J 
250 F 8 sob 4 
6 
200 + ye] 
= 6ob J 
1504+ 2 
8 aol J 
100+ E 
Zz 


3rd 


Data stream 


Figure 6.3: Number of BH-trees 


174 


Data stream 


versus data stream. 


CHAPTER. 6 CV Method for Indexing Continuous IoT Data 


6.3.2 Evolution of the number of indexes with the data 


stream 


The variation of the number of indexes as a function of the stream for the used 
datasets is presented, in figure 6.3, for the CV method and the two other scenarios. 
It is to notice that for the first data stream, the BH-tree of each cluster was 
directly constructed. The proposed method is used from the second data stream. 
As awaited, the use of the IEI method results in the construction of a minimum 
number of indexes that remains invariant with the data stream. In contrast to the 
IEI method, the use of the CNI method results in the construction of a maximum 
number of indexes that increases proportionally to the increasing number of data 
streams. For the CV method, the number of the constructed indexes is between 
the number of indexes from the IEI method and that from the CNI method. The 
number of indexes by the CV method is closer to that by the IEI method, for all 
datasets, which indicates that in the CV method, the insertion process is more 
pronounced that the construction process. We can see that the number of indexes, 
from the CV method, varies from a data to another. This depend, directly, on 
the the dynamic aspect of the DBSCAN clustering which induced a change of 


distances between clusters centers for each data stream. 


6.3.3. Evaluation of indexes construction 


For the evaluation of the BH-tree construction, the number of distances, the num- 
ber of comparisons, the time of indexing and energy consumption are calculated 


as a function of data stream. 
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6.3.3.1 Number of calculated distances 


In figure 6.4, the number of distances is traced as a function of the data stream for 
the three methods. We can see that the number of distances during the construc- 
tion of the BH-trees starts varying from the second data stream. From this data 
stream, the number of distances varies, from a method to another, as a function of 
the data size. For the four datasets, the CNI method presents the highest number 
of distances, since the creation of pivots requires more distances calculation, while 
the IEI method presents a less number because in the insertion process, no pivots 
are created. Despite the CV method combined both insertion and the indexing 
processes, the number of distances, from this method, is close to that from the IEI 


method and this reflects the efficiency of the CV method. 


6.3.3.2 Number of calculated comparisons 


The variation of the number of comparisons, as a function of the data stream, is 
plotted in figure 6.5. As can be seen, this variation is similar to that of the number 
of distances in figure 6.4. For the three methods, the number of comparisons is 
greater than the number of distances. During the construction of new indexes or 
during the insertion, comparisons are required to choose the left side or the right 


side of each BH-tree. 


6.3.3.3 Time of indexing 


As shown in figure 6.6, the time of indexing depends, not only on the data stream, 
but also on the size of each data stream. For GPS trajectory data, the time of 
indexing is great when the IEI method is used while for tracking and WARD data, 
the time of indexing is maximum when the CNI method is used. For the traffic 
dataset, the time of indexing varies from a method to another, as a function of the 


data stream. For the four used datasets, the indexing of data streams using the 
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Figure 6.4: Number of distances calculated 
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CV method takes acceptable times whatever the size of the data stream. Contrary 


to the IEI and the CNI methods, the CV method is not sensitive to the size of the 


data stream. 


6.3.3.4 Energy consumption during the indexing 


The energy consumption per stream is traced, in figure 6.7, for CV, CNI and 


IEI methods. The energy consumption 


E 


prog 


(in Joule) during the execution of a 


program prog is given by the following expression [282] : 
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Figure 6.5: Number of comparisons calculated during the indexing of each data 
stream. 


te te 
Espey = P(prog,t)dt -| P,(t)dt (6.1) 
t 


b ty 
where t, and ¢. the beginning time and the end time of the execution of the 
program prog (in second), P(prog,t) the electrical power needed for the execution 
of the program prog (in Watts) and P;(t) the idle power (in Watts). As can be 
seen in figure 6.7, the energy consumption during the indexing using these three 
methods varies from a dataset to another. For the GPS trajectory, the energy 
consumption is elevated during the data indexing using the IEI method. The 


energy consumption during the use of the CV method is a little bit less than 
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Figure 6.6: Time of data stream indexing using the CV method compared with 
both the IEI and the CNI methods. 


that during the use of the CNI method. Contrary to the GPS trajectory dataset, 
the energy consumption during the indexing of both tracking and WARD datasets 
using the CNI method is higher compared with the CV and the IEI methods. These 
last are mainly close. For the traffic dataset, the energy consumption during the 
indexing using the CNI and the IEI methods is comparable and is greater than 
that during the use of the CV method. 
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Figure 6.7: Average energy consumption during indexes construction using CV, 
CNI and IEI methods. 


6.3.4 Quality of the constructed BH-trees 


For the evaluation and the comparison of the quality of BH-trees constructed using 
the CV method with those from the IEI and the CNI methods, the average height 
of indexes, the average number of internal nodes and the average number of leaves 
nodes are plotted, for the four datasets, in figure 6.8. The number of nodes per 
level (Figure 6.9) and the data distribution in leaves (Figure 6.10) are determined 


after the indexing of all data streams. 
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Figure 6.8: Average height, average number of internal nodes and average number 
of leaves nodes of BH-trees constructed using the CV method, the CNI method 
and the IEI method. 


6.3.4.1 Average height of BH-trees 


Figure 6.8 presents the average height of BH-trees resulting from the indexing of 
streams of GPS trajectory, tracking, WARD and traffic datasets. For all datasets, 
the average height of indexes constructed using the IEI method is greater than 
that of indexes constructed using CNI method. This is due to the fact that in 
the IEI method, all data are inserted in constant number of BH-trees while in 
the CNI method, for each cluster, a BH-tree is constructed. We can also see, in 
figure 6.8, that the average height of indexes constructed using the CV method is 
comparable to that of the CNI method for the GPS trajectory and the tracking 
datasets while for the WARD and the traffic datasets, the average height from the 
CV method is greater than that for the CNI method and slightly surpasses the 
average height from the IEI method for the WARD data. The CV method changes 
its behaviors as a function of the data stream size and dimension. It behaves like 
the CNI method when indexing the GPS trajectory and the tracking datasets and 
like the IEI method when indexing the WARD and the traffic datasets. 


6.3.4.2 Average number of internal nodes 


The average number of internal nodes per BH-tree varies from a dataset to another 


as can be seen in figure 6.8. For GPS trajectory, tracking and traffic datasets, 
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the average number of internal nodes in indexes constructed using IEI method is 
greater than that in indexes constructed using CNI method contrary to the WARD 
dataset where the average number of internal nodes in BH-trees constructed using 
the CNI method is greater than that in indexes by the IEI method. For all datasets 
the average number of internal nodes constructed using the CV method is located 
between those of the CNI and IEI method. As awaited, the variation of the average 
number of leaves nodes, as a function of the indexing method, is similar to that of 


the average number of internal nodes (Figure 6.8). 


6.3.4.3. Number of nodes per level 


The number of nodes per level in BH-trees constructed using CNI, IEI and CV 
methods is traced in figure 6.9 for the four used datasets. As can be seen, the 
number of nodes per level varies from the a dataset to another. For the GPS 
trajectory, the number of nodes is constant in all levels of the BH-tree whatever 
the proposed indexing method. For tracking dataset, the variation of the number 
of internal nodes per level changes as a function of the indexing method. For the 
IEI method, five levels contain an elevated number of nodes while two levels with 
a maximum number of nodes are obtained from the CV method and only one level 
of maximum nodes is obtained from indexes by the CNI method. For the WARD 
dataset only one level with maximum number of nodes in indexes constructed 
using both CNI and IEI methods. The indexes constructed using the CV method 
contain two levels of maximum number of nodes. For the traffic dataset, one level 
with maximum number of node is obtained from BH-trees constructed using the 
CNI and the CV methods. For the IEI methods, three levels have high number of 
nodes. The number of nodes remains constant beyond level 25 for the CV mathod 


(2 nodes per level) and beyond level 60 for the IEI method (25 nodes per level). 
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Figure 6.9: Variation of the number of nodes per level of BH-trees constructed 
using the CNI, IEI and CV methods. 


6.3.4.4 Data distribution in BH-tree leaves 


The distribution of data in the left and the right sides of the BH-tree is plotted, in 
figure 6.10, for CNI, IEI and CV methods. For GPS trajectory dataset, resulting 
indexes constructed using both CNI and IEI methods are not balanced while in- 
dexes constructed using CV method are well balanced. For the trajectory, WARD 
and traffic datasets, indexes from the three proposed methods are well balanced. 
We can also see that the data distribution in indexes constructed using the CNI 


and IEI methods is similar whatever the used dataset. 
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Figure 6.10: Data distribution in leaves. 


6.3.5 Evaluation of the parallel kKNN search in BH-trees 


For the evaluation of the parallel KNN search with & = 5,10, 15, 20,50 and 100 in 
BH-trees constructed using CNI, IEI and CV scenarios, the number of distances, 
the number of comparisons, the time of search, energy consumption and the num- 
ber of visited leaves is determined to search 100 queries. It is to notice that all 


statistical results were averaged over 100 randomly generated queries. 
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6.3.5.1 Number of calculated distances 


The average number of calculated distances during the kNN search with k = 
5, 10,15, 20,50 and 100 is plotted, in figure 6.11, for the three methods. As can 
be seen in figure 6.11, the average number of distances varies from a data to 
another. For the GPS trajectory dataset the average number of distances during 
the querying search in BH-trees constructed using CNI and CV methods are close 
and less than that calculated during the query search in indexes constructed using 
the IEI method. This could be correlated with the variation of the average height 
of indexes (Figure 6.8) since the number of nodes per level is unvaried for the 
GPS trajectory data (Figure 6.9). For tracking and WARD data sets the average 
number of distances calculated during the kNN query search in indexes constructed 
using the CV method is less than the number of distances in indexes constructed 


using CNI and IEI methods. 


This can be related to the variation of number of nodes per level (Figure 6.9). For 
the tracking dataset and for levels between 5 and 10 the number of nodes from 
the CNI method is greater than that from IEI method and that of IEI method is 
greater than that the CV method. For the WARD data set, the number of nodes 
per level from the IEI method is greater than that from the CNI method which 
is greater than the number of nodes per level from CV method. For the traffic 
dataset, the number of distances calculated during the query search in indexes 
constructed using the CV method is greater than that in indexes by the CNI 
method and less than that in indexes by the IEI method. This could be directly 
related to the indexes height (Figure 6.8). 
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Figure 6.11: Number of distances calculated for the kKNN search in BH-trees by 
CNI, IEI and CV methods. 


6.3.5.2 Number of calculated comparisons 


The average number of comparisons calculated during the kNN queries search in 
BH-trees constructed using CNI, IEI, CV methods is presented in figure 6.12. A 


similar variation is observed for the four dataests. 
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Figure 6.12: Number of comparisons calculated for the kKNN search in BH-trees by 
CNI, IEI and CV methods. 


We can see that the average number of comparisons, calculated in indexes con- 
structed using the CV method, is less than that in indexes constructed using CNI 
and IEI methods. This may due to the fact that the use of the CV method results 
in fusion of clusters of similar objects in contrast to the IEI method in which, 


heterogeneous objects are inserted in constant number of indexes. 


6.3.5.3 Time of search 


Figure 6.13 shows the variation of the time of kNN search queries in BH-trees 
constructed using CNI, IEI and CV methods. The variation of the time of search 


for the three scenarios is related to the variation of both the average number of 
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distances (Figure6.11) and the average number of the visited leaves (Figure 6.14). 
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Figure 6.13: kNN search time of CNI, IEI and CV method. 


For the GPS trajectory dataset, the shortest time of search is obtained for indexes 
constructed using CNI method where the time of search varies from 0.0016 to 
0.0046s when k varies from 5 to 100. For this indexing method, the number 
of both the distances and the visited leaves are less then those of the two other 
methods. The time of search in indexes constructed using the CV method is nearly 
invariant as a function of k and always located between that of both CNI and IEI 
methods. The time of search in indexes by the CV method is around 0.0048s. 
For tracking and WARD datasets, the CV method presents the shortest time of 


kNN query search which varies from 0.008 to 0.02s for the tracking dataset and 
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from 0.227 to 0.596s for the WARD dataset when k varies from 5 to 100. For the 
traffic dataset, despite the minimum of the time of search correspond to the IEI 
method indexes, the search time for the three methods is close. It varies around 
0.0137 and 0.0338s when k varies from 5 to 100. The time of search for the traffic 
dataset is less than that for the WARD data because the number of visited leaves 
for the traffic dataset is less than that for the WARD dataset as can be seen in 
figure 6.14. For k = 100 and for tracking data, the time of search in indexes by 
the CV method is 46% of that by the CNI method and 69% of that by the IEI 
method while for the WARD data, it is 53% and 47% of that by the CNI and the 
IEI respectively. 


r r 
HEE CN! method 
1000 | HEGJIE!I method 

HE CV method 


Average number of visited leaves 


0 = 
GPS trajectory Tracking data WARD Traffic dataset 


Figure 6.14: Number of the visited leaves in CNI, IEI and CV method. 


For traffic dataset, the time of search in indexes by the IEI method is 96% of that 
by the CV method and the time of search for the CNI method is 97% of that 
by the CV method. In the CV method, when the coefficient of variation (CV) 
of the union of a new cluster from the incoming data stream with the existing 
first clusters is minimum this means that objects in the two clusters are very 
similar. Thus, objects, in the new cluster, are inserted in the index corresponding 
to the first cluster which make objects in this index more similar. That is why, 


during the kNN query search in the CV method indexes, the number of distances 
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and the number of visited leaves are less compared with the other methods. For 
the GPS trajectory dataset, the height of indexes influenced directly the search 
time because indexes constructed using CV, CNI and IEI methods have a constant 
number of nodes per level (Figure 6.9). In ref.|5], Benrazek et al. indexed the whole 
above cited datasets in a BCCF-tree. For k = 100,they found 0,16191, 0,21034 
and 2,72482s for GPS trajectory, tracking and WARD datasets respectively. For 
k = 100, the time of search, obtained by Zhang et al. [283], which is 0.682s for a 
data of 1 million is comparable to that of the WARD dataset of 3 millions size. 
The improvement of the time of search using our proposed methods came from 
the use of DBSCAN clustering algorithm witch results in the creation of clusters 


of high similarity. 


6.3.5.4 Energy consumption during the kNN search 


Figure 6.15 presents the energy consumption during the parallel kNN search with 
k; = 100 in indexes constructed using CV, CNI and IEI methods. For the four se- 
lected datasets, the CNI method consumes energy more than CV and IEI methods. 
That was awaited since the use of the CNI method induces the creation of more 
indexes. In addition, the use of parallelism induces an energy consumption in all 
indexes. The energy consumption during the 100NN search in indexes constructed 
using the CV method is comparable to that for the IEI method which reflects its 


efficiency during the indexing of continuous data streams. 
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Figure 6.15: Energy consumption during the 1O0NN search for CNI, IEI and CV 
method. 


6.4 Conclusion 


According to the comparison of results we conclude, although the kNN search time 
in indexes constructed using the CNI and the IEI methods is comparable with that 
of the above-cited methods, some of their characteristics are not desirable for con- 
tinuous IoT data stream indexing. The CNI method is capable of dynamically 
index the continuous IoT data stream; it produces a very large number of indexes. 
The construction of these indexes is a high cost process in terms of the number 
of distances, the number of comparisons, the time computing and the energy con- 
sumption. The large number of indexes increases also, the cost of the number 
of distances, the number of comparisons, the time and the energy consumption 


of search computing during the kNN query search. In addition, creating a large 
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number of indexes leads to the risk of memory overload, which negatively affects 
the indexing process of the continuous IoT data. The IEI method allows indexing 
the continuous IoT data stream with low computational cost of the number of 
distances, the number of comparisons and the time. However, it is not adapted to 
continuous IoT data stream. The inclusion of large various elements, from the con- 
tinuous incoming data, in a limited number of indexes increases the height of these 
last and reduces their similarity since the criterion of insertion is the minimum dis- 
tance between the existing and the incoming clusters centers. Increasing the depth 
of indexes and reducing their similarity may lead to the increase of the number of 
distances, the number of comparisons and the time of kKNN search computations. 
In addition, inserting large number of elements faces indexes, constructed using 
the IEI method to the problem of degradation. Compared with the CNI and the 
IEI methods, the CV method presents more capability to, dynamically, index the 
continuous IoT data stream. The coefficient of variation (CV) determines whether 
the resulting cluster from the union of the existing and the incoming clusters is 
similar or dissimilar. If the union of clusters is similar, incoming elements are 
inserted in the corresponding existing cluster index and, if the union of clusters in 
dissimilar, a new index is constructed from the incoming cluster. Having similar 
elements in each index reduces the cost of the computing of the number of dis- 
tances, the number of comparisons, the energy consumption and the time of kNN 
query search. In addition, creating new indexes, in the case where the union of 
clusters is not similar, makes this method an appropriate approach for indexing 
continuous IoT data stream. By using the CV method, we avoid the problem of 
the infinite number of indexes and the degradation of the indexes and we guarantee 
the construction of new indexes with low energy consumption and without data 


overlap. 
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7 ‘TD Method for Indexing Contin- 


uous loT’ Data 


7.1 Introduction 


In the previous chapter, we presented a new approach to index the continuous 
IoT data stream using the Coefficient of Variation (CV) method. We propose, in 
this chapter, a new method to index continuous IoT data flow taking the benefit 
from the fog-cloud architecture. The proposed method, called Threshold Distance 
(TD) is a two step process. The first process consists on the grouping of the 
arrival data flow into homogeneous clusters by means of the DBSCAN algorithm 
[119]. The second process is the construction of GHT (Generalized Hyperplane 
Tree) [220],|262] for each cluster. The clusters of the first data flow are directly 
indexed while for those of the next data flows, they will be indexed or inserted in 
existing GHT after comparing the distances between their centers and the existing 
clusters centers to a threshold distance value. To test the efficiency of this proposed 
method, the experimental results will be compared to those of our second proposed 
method, in this chapter, called Creation of a New Tree (CNT) in which, a new 


GHT is constructed for each arrival cluster. 


The present chapter starts with a detailed description of the proposed approach 
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followed by the exposition of the experimental results of the GHTs construction 
and those of the kNN query search. The experimental results of both the GHTs 
construction and the kNN query search will be discussed and compared with those 
of the CNT index.The experimental results of the kNN search will be also compared 
with those of the CV method presented in the previous chapter.In the last, The 
experimental results of the kNN search will be also compared with those of the 


B3CF-trre method presented in the previous chapter. 


Cloud 
layer 


Fog layer 


Terminal 
layer 


Figure 7.1: Architecture of TD method. 


7.2 Proposed Approach 


In an IoT environment, data is continuously sent from multiple devices to data 
warehouses in the cloud. In this approach, before sending data directly to the 
cloud, we first process it in the fog layer. Because our proposed approach is a two 
step process, the fog layer is divided into two levels: the clustering fog level and 


the indexing fog level (Figure 7.1). 
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The first process, which takes place in the clustering fog level, consists on the 
grouping of each data flow Fl, of center Cr; into n clusters using the DBSCAN 
algorithm [119] This results in clusters of high homogeneity and high density with 
centers C,, (Algorithm 9). In the second step, in the indexing fog level, a GHT 
(Generalized Hyperplane Tree) [220],|262] is constructed for each cluster Cl of 


the first data flow. For the next data flows Fl’ of a center C” 


py each resulting 


cluster Cl’ of a center C’ will be indexed or inserted in an existing GHT after 
the comparison of the minimum distances between C,,, and the centers of the first 
clusters dmin(Cn, C’.) to the threshold distance value T’D. The threshold distance 
TD is determined as the average distance between the next data flow center Ca 
and the centers of the first data flow clusters C,,. If dmin(Cn, C's) > TD, a new 
GHT is constructed and the cluster center C’/, is added to the first data flow 
clusters centers Cy. If dmin(Cn, C’,) < TD, object of the cluster Cl’ are inserted 


in the GHT which correspond to the cluster center C,,. 


195 


CHAPTER. 7 TD Method for Indexing Continuous IoT Data 


Algorithm 9 TD method 


Require: Fl //Data flow 
Ensure: GHT's 
// Clustering of the first data flow 
if Cl = then 
(n, Cl,) ~ DBSCAN (Fl) 
//n:number of clusters. Cl,,: set of clusters cl 
for cl € Cl, do 
C,, Calculate center (cl) 
//Cy: set of clusters centers ¢, 
end for 
// Indexing of the first clusters Cl, 
for cl€ Cl, do 
GHT < Build(cl) 
end for 
else if Cl 40 then 
FU ¢ Fl 
//FU: next data flow 
//Calculation of the center of the next data flow 
Ca Calculate center(FI) 
//Clustering of the next data flow 
(n', CU) — DBSCAN(FI) 
//n':muamber of next clusters. Cl: set of next clusters cl’ 
for cl € Cl do 
C, Calculate center(cl’) 
// Ca set of next clusters centers C, 
end for 
//Calculation of the threshold distance T'D 
TD+mean(distance(C,,,, Cu) 
for each cl’ € Cl, do 
for each cl € Cl, do 
if dmin(Cn,¢,/) >TD then 


GHT «+ Build(cl) 
Add cl’ to Cl, 
else 
Insert objects of cl’ in GHT (cp) 
end if 
end for 
end for 
end if 
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To check the efficiency of the proposed TD method, the experimental results, in 
the next section, will be compared to another proposed method extracted from 
the TD method. We call this method Creation of a New Tree (CNT) and it is 
also developed in the same fog-cloud architecture. In this method, as its name 
indicates, a new GHT is constructed for every arrival cluster. The CNT method 


is described in the algorithm 10. 


Algorithm 10 CNT method 


Require: Fl //Data flow 
Ensure: GHT's 
// Clustering of the first data flow 
(n,Cl,) — DBSCAN (FI) 
//n:mumber of clusters. Cl,: set of clusters cl 
for cle Cl, do 
GHT < Build(cl) 
end for 
FI Fl 
//FU: next data flow 
(n',Cl,) © DBSCAN(FI) 
//n':number of next clusters. CL: set of next clusters cl! 
for cl’ € Cl’ do 
GHT © Build(cl’) 
end for 


7.3. Simulation and Results 


Experiments were implemented using Python programming language installed on 
an Intel(R) CoreTM i7-8550u, 1.80 GHz processor*8 with 16 GB RAM, 256 GB 
SDD ROM under 64-bit Linux operating system (Ubuntu). To evaluate our pro- 
posed TD method by comparing with our CNT method, two real datasets were 
used. The tracking dataset contains 6.27k vectors of dimension 20 [5] and the 
smart home data contains 10M vectors of dimension 4 [275]. For the simulation 


of data flows, both datasets were divided into subsets considered as flows. The 
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characteristics of these data flows are regrouped in Table 7.1. 


Table 7.1: Characteristics of the used data flows. 


Number | Vectors 


Dataset | Dimension of flows | per flow 


Tracking 

Agtacat 20 6 1.04k 
Smart 4 5 OM 
home 


For the experimental results, the evolution of the number of GHTs, the evaluation 
of GHTs construction and the evaluation of the parallel KNN search in these indexes 
will be presented and analyzed, as function of the data flow, for the proposed TD 
method. The results from our CNT method will be used for the comparison with 


the TD method. 


7.3.1 Evolution of the number of GHT 


Figure 7.2 presents the evolution of the number of the constructed GHT as a 
function of the data flow. We can see that the number of GHT is mainly constant 
when using the TD method and increases proportionally to the data flow when 
using the CNT method. The use of the TD method results in the construction 
of 17 GHT for tracking dataset and 7 GHT for smart home data while for the 
CNT method, 84 GHT were constructed for the tracking dataset and 35 GHT for 
the smart home data. This indicates that the TD method reduces considerably 
the number of constructed indexes when compared with the CNT method.In CNT 
method,with each new data flow, new GHT are produced, meaning that their 
number is constantly increasing, but In TD, despite the increase in the number of 


data flow, the number of indexes remains stable. 
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Figure 7.2: Evolution of the number of GHT as a function of the data flow. 


7.3.2 Evaluation of GHT construction 


For the evaluation of GHT construction, the computed distances, the computed 
comparisons and the construction cost will be presented as a function of the data 


flow. 


7.3.2.1 Computed distances 


From the second flow, and for both used datasets, the computed distances of the 
CNT method are much higher than those of the proposed TD method (Figure 
7.3). This is due to the fact that, during index construction, many distances are 
calculated for pivots determination and also, for objects inserted in the left or in 
the right side. This is not the case for objects insertion since there is no pivots 


calculation. 
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Figure 7.3: Computed distances the TD method and the CNT method as a function 
of the data flow. 


7.3.2.2. Computed comparisons 


As for the computed distances, the computed comparisons for both datasets are 
greater in the CNT method than those in the TD method (Figure 7.4). This 
reflects the efficiency of the TD method since it combines the insertion process 
and the construction process depending on the threshold distance value. For the 
fifth and the sixth tracking data flows, the lowest computed comparisons using the 


TD method indicates that the whole objects were inserted in existing GHT. 


7.3.2.3 Computing time 


Figure 7.5 presents the computing time of the TD and CNT methods, for both 
datasets, as a function of the data flow. For the tracking dataset, the computing 
time of the TD method is better than that of the CNT method while for the smart 
home data, the situation is unversed, the computing time of the CNT method is 
better than that of the TD method in spite of the good results of the TD method 
concerning the computed distances (Figure 7.3) and the computed comparisons 


(Figure 7.4). Two parameters influence considerably the obtained results: the 
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Figure 7.5: Computing time for CNT method and TD method for each data flow. 


number and the mean height of indexes. In figure 7.6 is traced the mean height 


of indexed, for both datasets, constructed using the TD method and the CNT 


method. As can be seen, in figure 7.6-a, the mean height of GHT constructed using 


the TD method is greater than that of GHT constructed using the CNT method 


for both datasets. 


GHT as it is shown in figure 7.6-b. 
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Figure 7.6: Mean height (a) and global number of GHT (b), for both used datasets, 
using the TD and the CNT methods. 


mean height of GHT constructed using the CNT method is less than that of GHT 
constructed using the TD method because the global number of GHT constructed 
using the CNT method is much higher than that using the TD method. 


7.3.3 Evaluation of parallel kKNN search 


For the evaluation of parallel KNN search with k = 5, 10,15, 20,50 and 100 in GHT 
constructed using our proposed TD method, distances, comparisons and time of 
kNN search will be computed to reach 100 queries. To test the efficiency of the 
kNN search in GHT constructed using the TD method, results will be compared 
with those of the kNN search in GHT constructed using the CNT method. 


7.3.3.1 Distances in parallel kNN search 


Figure 7.7 shows the distances computed during the parallel kNN search with 
k = 5,10,15,20,50 and 100 in GHT constructed using the TD method and the 
CNT method. We can see that, for both datasets, the computed distances using 
the TD method are less than those computed using the CNT method. This is 
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Figure 7.7: Computed distances during kNN search in indexes constructed using 
the TD method and CNT method. 


because, in the TD methods, objects in clusters are inserted in GHT of the closest 
clusters while in the CNT method, in spite of the closeness of clusters, a GHT is 


constructed for each one. 


7.3.3.2 Comparisons in parallel kNN search 


As awaited, the computed comparisons during the kNN search in indexes con- 
structed using the TD method present the lowest number compared with the CNT 
method (Figure 7.8). This reflects the efficiency of the TD method which allows in- 
sertion of objects in some conditions related to the threshold distance. For k > 50, 
the computed comparisons of the CNT method increased considerably compared 
with the computed comparisons of the TD method which proves, an other time, 


the efficiency of the TD method for indexing continuous data flow. 


7.3.3.3. Time of kKNN search 


In figure 7.9 is traced the time of parallel kKNN search with k = 5,10, 15,20, 50 
and 100 in GHT constructed using the TD method and the CNT method. We 
can see that the time of kNN search in the TD method GHT is better than that 
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Figure 7.8: Computed comparisons during kNN 
using the TD method and CNT method. 


1 
100 


search in indexes constructed 


in the CNT method what confirms the efficiency of the TD method for processing 


the continuous data flow. For k < 30, the time of kNN search in the TD method 
GHT is less than that of the CNT method by 22% for the tracking dataset and 


by 45% for smart home data. This difference changes for k = 100. The time 


of the kNN search for the TD method is less than that for the CNT method by 


32% for the tracking dataset and by 47% for the smart home data. Even the TD 


method surpasses the CNT method, it proves its efficiency when comparing with 


the BCCF-tree [5] and the IWC-tree|5] in which, the whole dataset was indexed 


in one tree. 


The TD method is largely surpassed by the B3CF-tree in which, 


parallelism was used in data indexing and in kNN similarity query search. 
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Figure 7.9: Time of kNN search in 
the CNT method. 
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It is to notice that, in order to make the comparison with the TD method, the 


implementation results for the BCCF-tree and the [WC-tree were obtained after 


computing using our own machine. 


7.3.3.4 Comparison of the time of kNN search between CV and TD 


method 
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Figure 7.10: Time of search in CV and TD method. 
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For the tracking dataset, the only common dataset in testing CV and TD methods, 
the time of the kKNN search for k = 100 was 0.029s for the TD method. The 
comparison of the parallel kKNN search in indexes constructed using the CV method 
and the TD method is presented in figure 7.10 for the tracking dataset. As can 
be seen, the CV method surpasses the TD method. The time of kNN search for 
k; = 100 is 0.02s. The parallel kNN search time presents the same evolution as a 
function of the parameter k for both methods. However, the TD method depend 
strongly on the threshold distance determination method which presents a serious 


limitation when indexing continuous IoT data. 


7.4 Conclusion 


In order to index the continuous IoT data flow, an efficient method, called threshold 
distance (TD) is proposed in this chapter. This method, developed in the fog- 
cloud architecture, is a two step process. In the first process, which takes place 
in the clustering fog level, the data flow is grouped into clusters by means of the 
DBSCAN algorithm. In the second process, in the indexing fog level, data of 
each cluster is inserted in an existing GHT or a new GHT is constructed after a 
comparison of the distance between the centers of the first cluster and the next 
clusters to a threshold distance value. To check the efficiency of the TD method, 
it was compared to an other method called the creation of a new tree (CNT). 
The experimental results showed that both methods are efficient compared with 
two other indexing methods. The experimental results showed also, that even the 
TD method surpassed the CNT method during the construction of GHT and the 
parallel kNN search, it seems insufficient in front of the CV method in term of the 


time of kNN query search. 
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8 Parallel KNN Search in QCCF-tree 
Nodes 


8.1 Introduction 


In these previous propositions, the enhancement of the efficiency of the kNN query 
search, in indexes, by the introduction of parallelism allowed by the use of DB- 
SCAN clustering algorithm, as a pre-indexing process, was evidenced. However, 
what about the use of parallelism in the inner of indexes? To response to this 
question, a new index called Quad-tree based on Containers at the Cloud-Fog 
computing level (QCCF-tree) inspired from the BCCF-tree [5] is proposed. It is 
constructed in metric space where the data is divided into four balls with four 
pivots. The choice of four pivots is to eliminate data overlapping and index de- 
generation problems. For the speed up of the kNN query search, parallelism is 
used in the inner of the QCCF-tree i.e. in the QCCF-tree nodes. The present 
chapter starts with a detailed description of the proposed approach followed by 
the exposition of the experimentation and the evaluation the construction and the 
kNN query search results of the proposed QCCF-tree by making a comparison 
with the results of our proposed B3CF-tree and those of some existing indexes 


namely BCCF-tree [5], [WC-tree [5], MX-tree [247| and BB-tree |176],[201]. 
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Figure 8.1: Cloud-fog computing architecture. 


8.2 Proposed Approach 


In spite of the efficiency of the BCCF-tree [5] in large-scale data indexing, it 
presents an elevated time of construction compared with other indexes such as 
BB-tree [176] and MX-tree [247]. In this section, we have investigated the char- 
acteristics of the cloud-fog architecture for the construction of our index called 
Quad-tree based on Containers at the Cloud-Fog computing level (QCCF) sup- 
posed to be the improvement of the BCCF-tree. Our system architecture, similar 
to that of the BCCF-tree [5], consists of three layers (Figure 8.1): the IoT devices 
layer (or terminal layer), the fog layer, and the cloud layer. The terminal layer 
sends the data generated by the interconnected IoT devices to the fog layer. The 
fog nodes are close to the IoT devices and have the ability to compute and store 
the data. In this approach, data is indexed and the QCCF-tree is constructed in 
the fog layer. The leaves of the nodes in the constructed QCCF-tree are stored 
in the cloud layer. The QCCF-tree is based on the division of the space, in the 
fog layer, into four non-overlapped sub-spaces (or partitions). The creation of four 
partitions follows a two-step process (Figure 8.2): In the first step, the space is 


divided into two regions, left and right, by choosing the two farthest objects as left 
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pivot p, and right pivot pr. In the second step, each region is divided in turn into 
two partitions top and bottom. Pivots are always chosen as the farthest objects. 
This partitioning process results in the creation of four balls with pivots p1, p2, ps3 
and p4. 

We define the QCCF-tree nodes N (Figure 8.2) as follow: 


Step 1 (Hy) Step 2 (Hh) Top right 
Left region 
e 
Bp —ae (H,) 
o. é 
é 
oY ; \ 
: On “ \ , . 
ah é 7 . F \ Right region 


eee Bottom left Bottom right 


Figure 8.2: Partitioning of space in QCCF-tree. 


e I leaf node- consists of a subset indexed objects:E, C FE where |Er| < Cmax 


the contents of the leaves partitions F. 


e N Internal node is a duodecuple : 


(P1, P2, P3, Pa, 11,72, 73, 74,712, 134, Ni, No, N3, Na) € O04 x R® x Nt. 

where : 

12 = d(p1, P2) lets to define two balls By (p1, 712) and Bo(po,r12), centered on 
p, and pz respectively and having a common radius value, large enough for 
the two balls to have a nonempty intersection. 

734 = d(p3, pa) lets to define two balls B3(p3, 734) and Bo(p2, 1734), centered on 
pz and p4 respectively and having a common radius value, large enough for 
the two balls to have a nonempty intersection. 


(r1,172,73,T4) are the distances to the farthest object in the subtree rooted 
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at that node N with respect to p, , p2,p3 and po respectively. 


Ty = d(pi, p2) 
T34 = d(p3, pa) 


(N,, No, N3, Nz) are four subtrees, such that: 

N, = {0€ N : d(pi,0) < d(p2, 0) < d(p3,0) < d(pa,o)} 
Nz = {0 € N : d(p2,0) < d(pi, 0) < d(p3,0) < d(pa,o)} 
N3 = {0 € N : d(p3,0) < d(pa,0)) < d(pi, 0) < d(p2,0)} 
Na = {0 € N :: d(pa,0) < d(p3, 0) < d(pi, 0) < d(p2, 0) } 


The construction of the QCCF-tree is presented in the algorithm 11. 


8.2.1 QCCF-tree build 


In the incremental process of the QCCF-tree construction, the insertion of objects 
is done from top to bottom. The formal description of the QCCF-tree construction 
process is presented in algorithm 1. Initially, the tree is empty and is considered as 
a leaf. The farthest two-pivot search algorithm is used for all objects to divide the 
space into two regions (left and right). After that, this algorithm is used, first, for 
objects in the left region to divide them into two partitions (top and bottom) with 
pivots p; and p2 and, second, for objects in the right region to divide them into 
two partitions (top and bottom) with pivots p3 and py. As a result, the container 
is divided into four non-overlapped subsets so that each element in the container 
belongs to its nearest pivot. This transforms the leaf into an internal node with 


four pivots p1,p2,p3 and p, that create four leaf nodes (Figure 8.2). 
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Algorithm 11 Construction of QCCF-tree 
Build QCCF (5 € P()) EN 


with: 
(P1, P2, P3, P4)— The four farthest pivots 
i Gees 
(el 1. i) pte} 
Pi, P2, P3, P4 
< BuildQCCF : ({e € 5: d(pi,e) < d(p2,e) < d(p3,e) < d(pa,e)} \ {p1}) 
BuildQCCF : ({e € S: d(p2,e) < d(pi,e) < d(p3,e) < (pa,e)} \ {p2}) 
BuildQCCF : ({e € 5S: d(ps,e) < d(pa, e) < d(pi,e) < d(pa, e)} \ {ps}) 
BuildQCCF : ({e € S:: d(pa,e) < d(p3,e) < d(pi,e) < d(po,e)} \ {pa}) 


8.2.2 Parallel kKNN search in QCCF-nodes 


Parallelism is used, in this approach, for the minimization of the time of the 
similarity search query process. Contrary to all indexing trees, parallelism is done 
in the QCCF-nodes level because of the presence of four pivots in each internal 
node, which may increase the time of the sequential kNN search process. The 
formal description of the kNN search in each QCCF-node is presented in algorithm 
12. In each internal QCCF-node, kNN search is performed in the left region (with 
pivots p; andp2) and the right region (with pivots ps and p,) in parallel. The aim 
of the k-nearest neighbor search is to find the set A of objects closest to a query 
point g. The set A represents the fusion of the set Aj (in the left region) and 
the set A3, (in the right region). The kNN search algorithm starts with a query 
radius rg = min(rqi2,7q34) Where rqi2 is the query radius in the left region, and 
rq3a is the query radius in the right region. Both radius are initialized to +00 
which should lead to scanning the dataset and then decreases by traversing each 
node which corresponds to the distance to the k’” object in the ordered list Aj. 
and the ordered list A3, respectively. Comparing d,(q,p1) and d2(q, p2) with rqi2 
and d3(q,p3) and d4(q,pa) with rq34 indicates the descent of the query point in 


the index. The leaf nodes contain a subset of the indexed data with a maximum 
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Algorithm 12 Parallel kNN search in QCCF-nodes. 


NEN, 
qeER’, 
kNN-QCCF : oo ee € (Rt x oN 
rg € Rt = +00, 
Ae (R+ x0)" =6 
with : 


e Ap =kNN-QCCF(L,4¢,k, d, rq, k—insert( Aye, ((d(pi, ¢), p), d(p2, G7), P))) 
e Ay=kNN -—QCCF(R, gq, k, d, rqga, k — insert( Aza, ((d(p3, g), p), (pa, g), P))) 
e rq. = mar{d: (d,o) € Ay} if |Ay| =k 

e rqza = max{d: (d,o) € Asq} if |As4| =k 

erg = min(rqe,74qQ4) 


e A —= Ajo U Ao3 
. { Aif(V =2) 
= 4 Aj, if(N = (p,r,L,R) Ad(q,p1) < riz A d(q, p2) < re) || 
Aza, if(N = (p,r,L, R) A d(q, p3) < rqsa A d(q, pa) < 7934) 


cardinal Cmax. To find the k nearest neighbors of a leaf, we simply sort the indexed 
data according to their increasing distances to the query g. As a result of the 


search, the first & sorted objects in the list A are returned. 


8.3 Simulation and Results 


To perform our experiments, two datasets of different sizes and dimensions are 


used. 


1. Geographical coordinate database: a real dataset of 988 2D vectors. It con- 


tains BD-L-TC topographic data of selected locations and places [270]. 


2. Tracking of a moving object: a real dataset of 62702 20D vectors. It repre- 
sents the results of a random simulation of tracking a moving object using 


wireless cameras [5]. 
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The experiments were performed using the Python programming language in- 
stalled on an Intel (R) CoreTM i7-8550u CPU, 1.80 GHz*8 processor with a 16 Gb 
RAM, a 256 Gb SDD ROM and 64-bit Linux operating system (Ubuntu). The 
aim of the experiments is to analyze the effectiveness of the proposed QCCF-tree 
index construction and query response by comparing our results to those obtained 


with the following index structures: 


e BCCF-tree (Binary tree based on containers at the cloud-fog computing 
level) [5]: this index is based on recursive space partitioning using k-means 


clustering algorithm to efficiently separate the space into two subspaces. 


e IWC-tree (Indexing tree without containers) [5]: The comparison of our 
results with those of this index can show the effectiveness of using containers 


in binary trees. 


e MX-tree |247|: The comparison of our results with those of this index can 
highlight the difference between hyperplane and ball partitioning in metric 


space. 


e BB-tree (Bubble Buckets tree) [176]: A comparison with our proposed index 
will show the difference between the metric space structure and the multidi- 


mensional space structure. 


8.3.1 Evaluation of the QCCF-tree construction 


In this section, the evaluation of the construction of the QCCF-tree, wil be done by 
evaluating the number of computed distances(Figure 8.3), the number of compar- 
isons(Figure 8.4), and the construction time (Figure 8.5). The size of the container 


in the QCCF-tree is set as Cmax = V/N. 
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8.3.1.1 Number of calculated distances 


As can be seen, in figure 8.3, the number of calculated distances varies from a data 
to another. However, for both used data, the minimum number of distances is 
calculated for the BB-tree while the maximum number of distances is calculated 
for the WC-tree. This is awaiting from these two indexes since the BB-tree is based 
on multidimensional space partitioning without distances calculation and, in the 
IWC-tree, the indexing is done for the whole objects in the data. The proposed 
QCCF-tree exhibits the lowest number of distances compared with the BCCF-tree 
and the IWC-tree, for tracking dataest. For the geographical coordinates data, the 
number of the calculated distances in the QCCF-tree is close to that of the MX- 
tree. Partially indexing of data in containers diminished considerably the number 
of distances when compared with the [WC-tree. For the geographical coordinates 
data, the number of distances computed in the BCCF-tree is comparable to that 
computed in the [WC-tree and this may be due to the use of k-means algorithm 


for the determination of the two pivots in the BCCF-tree. 


jee [ Tracking Dataset 

1,2x107 + 4 

8,0x10° + | 

4,0x10° - 4 
0,0 a i bid rl 

QCCF-tree BCCF-tree BB-tree MX-tree IWC-tree 


T T T 
1,5x10* + Geographical Coordinates 4 


1,0x10* - 4 


Number of Distances 


5,0x10° + | 
20 


QCCF-tree BCCF-tree BB-tree MX-tree IWC-tree 
Index structure 

Figure 8.3: Number of distances of QCCF-tree, BCCF-tree, BB-tree, MX-tree and 

IWC-tree. 
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8.3.1.2 Number of comparisons 


The number of comparisons is presented, in figure 8.4, for the selected indexes. One 
can see that the number of comparisons calculated for our proposed QCCF-tree is 
the lowest compared with the other indexes which indicates its efficiency. Indeed, 
the division of data into four subsets, in the container, results in the creation 
of subsets containing closest objects which induces the reduction of the number 
of comparisons. Contrary to the number of distances, the BB-tree exhibits the 


elevated number of comparison. 


1 
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8,0x10° + 
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Figure 8.4: Number of comparisons calculated of QCCF-tree, BCCF-tree, BB-tree, 
MxX-tree and I[WC-tree. 


8.3.1.3 Construction time 


The variation of the construction time of the chosen index structures is presented in 
figure 6.6. For the tracking dataset, the construction time of our proposed QCCF- 
tree is lower than that of the BCCF-tree and the [WC-tree and, it is comparable to 
those of the BB-tree and the MX-tree. For the geographical coordinates data, the 
construction time of the proposed QCCF-tree is close to that of the BCCF-tree, 
where the difference of the time is about 0.015 second, and is greater than those 


of the BB-tree, the MX-tree and the IWC-tree with a difference of time around 


215 


CHAPTER. 8 Parallel kNN Search in QCCF-tree Nodes 


0.04 second. 
According to the results of the number of distances, the number of comparisons 
and the low difference in the construction time, the proposed QCCF-tree could be 


considered as a competitive structure for in-fog IoT data indexing. 
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Figure 8.5: Construction time of QCCF-tree, BCCF-tree, BB-tree, MX-tree and 
IWC-tree. 


8.3.2 Evaluation of the in-node parallel kNN search 


For the evaluation of the in-node parallel kNN search of similarity query, the 
number of distances, the number of comparisons and the search time , for both 
used datasets, are taken as the average of 100 queries. The variation of these three 
characteristics as a function of the parameter k, where k = 5,10,15,20,50 and 
100, are compared with the results of the BCCF-tree, the BB-tree, the MX-tree 
and the [WC-tree . 


8.3.2.1 Number of calculated distances 


For both used datasets, the calculated number of distances in the proposed QCCF- 
tree is less than that of the other structures (Figure 8.6) which reflects the efficiency 
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of the use of four balls with four pivots for data partitioning and the parallelism 
when browsing the index during the similarity query search. For the geographical 
coordinates data, the number of distances, calculated in the QCCF-tree, increases 
from 44, for k = 5, and stabilises at 55 from k = 20. For the tracking dataset, the 
number of distances, calculated for the QCCF-tree, increased without stabilizing. 
However, the ratio between the number of distances for k = 5 and that for k = 100, 
which is 78%, indicates that the number of distances nearly invariant as a function 


of the parameter k. 
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Figure 8.6: Number of distances calculated for the kNN search in QCCF-tree, 
BCCF-tree, BB-tree, MX-tree and IWC-tree. 


8.3.2.2 Number of calculated comparisons 


In figure 8.7, is presented the calculated number of comparisons as a function of the 
parameter k. Like for the number of distances, the QCCF-tree exhibits the lowest 
number of comparisons compared with the other indexes. This also confirms the 
efficiency of our proposed index in the similarity query search. For the geographical 
coordinates, the calculated number of comparisons, in the QCCF-tree, increases 
from 84, for k = 5 and stabilises at 139 beyond k = 15. For the tracking dataset,the 


calculated number of comparisons, in the QCCF-tree, increases by a magnitude of 
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10 from k = 5 (20.239 x 10%) to k = 100 (26.4251 x 10*). For the same dataset, 


the number of comparisons, in the QCCF-tree represents 15% of the number of 


comparisons, in the BCCF-tree, for k = 5 and 13% for k = 100. 
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Figure 8.7: Number of comparisons calculated for the kNN search in QCCF-tree, 
BCCF-tree,BB-tree, MX-tree and [WC-tree. 


8.3.2.3 Time of search 


The efficiency of an index could be evaluated from the data retrieve time. The 
in-node parallel kNN search time in the proposed QCCF-tree is plotted, in figure 
8.8, with the search time of the BCCF-tree, the BB-tree, the MX-tree and the 


IWC-tree as a function of the parameter k. 


As can be seen, in figure 8.8, the search time in the QCCF-tree has the lowest 
value compared with the other indexes. This was awaited after the evaluation 
of the number of distances and the number of comparisons where the in-node 
parallel search efficiency was evidenced. For the geographical coordinates data, 
the kNN search time, in the QCCF-tree, is mainly invariant, as a function of the 
parameter k, with a mean value of 0.0013s which is less than the search time, in 


the BCCF-tree (0.0025s for k = 5). 
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For the tracking dataset, the kNN search time, in the QCCF-tree, is also invarient 


as a function of the parameter k. 
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Figure 8.8: Time of kNN search in QCCF-tree, BCCF-tree, BB-tree, MX-tree and 
IWC-tree. 


Its mean time of 0.021s is less than the search time, in the BCCF-tree, for k = 5 
which is equal to 0.1s. Our results are also comparable to those in literature. In 
[237], for k = 50, the query time is 0.1s on Foursquare dataset while in [209], the 
execution time for spatial range query on R*-tree on spark is 0.02s. In [263], with 


a number of workers of 32 and for k = 4, the query cost 2.7s for CoPHIR dataset. 


8.3.3 Comparison between B3CF-tree and QCCF-tree 


Face to the above interesting results, the QCCF-tree could be considered as the 
improvement of the BCCF-tree especially during the combination of parallelism 
with the kNN search method. However, a comparison with our next proposed 
index (B3CF-tree), in which parallelism is used during indexes construction and 
the kNN query search, must be done to find whether the index that could be 


considered as an efficient alternative for IoT data indexing, storing and searching. 
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According to the above experimental results, our metric space proposed approach 
proved its efficiency regarding the use of parallelism during both the B3CF-tree 
construction and the kNN query search. However, a confrontation to the QCCF- 
tree must be done in order to find the best alternative for big IoT data indexing 
and retrieving. The kNN search time, with & = 5, 10,15, 20,50 and 100, in the the 
B3CF-tree is presented with the kNN search time in QCCF-tree nodes in figure 
8.9 for the geographical coordinates and tracking datasets. As can be seen, the 


B3CF-tree presents much better results compared with the QCCF-tree. 


Even for the construction time, figure 8.10 shows that the results of the B3CF-tree 
are also much better than those of the QCCF-tree for both datasets which, without 
a doubt, make of the B3CF-tree the efficient alternative for loT data indexing and 


queries retrieving. 
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Figure 8.9: Time of kNN search in QCCF-tree and B3CF-tree. 
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Figure 8.10: Construction time in QCCF-tree and B3CF-tree. 


However, despite its efficiency, the B3CF-tree faced a limitation that is the query 
search cost. Indeed, the use of parallelism makes the kNN query search simultane- 
ous in all fogs, which will multiply the consumed energy taking into consideration 
that the search result will be finely send from only one index in one fog. The se- 
quential kNN search does not consume energy compared with parallel kNN search. 
However, it presents a latency problem because if the query is not found in one 


fog, the kNN search will be done in the next fog and so on. 


8.4 Conclusion 


In this chapter, the use of four balls with four pivots partitioning was done in order 
to overcome the problem of the efficiency of indexing, storing and retrieving IoT 
big data. This is because the dividing of the exponentially-grown data into subsets 
using balls, with one or two pivots, induced the degeneration of the index due to the 
inherent inadequacy of space partitioning. Our proposed index structure, called 
QCCF-tree, exhibited interesting and competitive experimental results either in 


the construction or in the similar query search using parallelism when browsing 
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the index nodes. Indeed, the comparison of the index construction evaluation and 
the similarity search results of the QCCF-tree with those of the BCCF-tree [5], 
IWC-tree [5], MX-tree [247] and BB-tree [176],[201] showed that the QCCF-tree 
surpassed them largely. However, when comparing these results with those of 
the B38CF-tree, the QCCF-tree exhibited a remarkable insufficiency in both index 


construction and similarity query search. 
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This thesis work represents our contribution in the similarity queries search in 
metric space in IoT systems. The kNN method was used for similarity queries 
search in proposed indexes developed, in metric space, using the fog-cloud archi- 
tecture which contains a terminal layer, a fog layer and a cloud layer. The indexing 
process is shifted from the cloud to the fog nodes to get the data near the indexing 
structure and thus, reduce the network traffic congestion significantly. Moreover, 
each fog node creates its unique indexing structure, allowing not only parallelism 
during trees construction, but also parallelism in the search process by launching 
the same query simultaneously on all fog nodes. In the first proposed approach, 
called the B3CF-tree (Binary tree based on Containers at the Cloud-Clusters Fog 
computing level), each fog node is divided into clustering fog level and indexing 
fog level. In the clustering fog level, IoT data sent from the terminal layer is parti- 
tioned into homogeneous groups, or clusters, in terms of type and dimension using 
the DBSCAN algorithm. The aim of the clustering process was to generate a bal- 
anced trees with a reduced degree of overlapping between leaves. In the indexing 
fog level, objects in each cluster are indexed, in parallel, in B3CF-trees. The data 
partitioning, using the DBSCAN algorithm, allowed parallelism, not only when 
indexing data of the resulting clusters, but also, when using the kNN method 
for the similarity queries search. The experimental results, obtained using one 


data stream, showed that the proposed B3CF-tree outperforms indexes in litera- 
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ture, such as BCCF-tree, IWC-tree and BB-tree, in terms of indexes construction, 


indexes quality and the similarity query search using the kNN method. 


For indexing continuous data stream generated from IoT devices, two other ap- 
proaches were proposed: the Coefficient of Variation (CV) method and the Thresh- 
old Distance (TD) method. For both proposed approaches, the kNN search method 


was combined with parallelism when searching the similarity queries. 


In the CV method approach, the fog layer is divided into three levels: the clustering 
level, the clusters processing level and the indexing level. In the clustering fog level, 
the first data stream is grouped into clusters using DBSCAN algorithm which are 
stored in the clusters processing fog level while their corresponding BH-trees are 
directly constructed, in parallel, in the indexing fog level. After the clustering of 
the arrival data stream, in the clustering level, the coefficient of variation (CV) of 
the union of of each arrival cluster with the a copy of first clusters is calculated 
and, according to the CV value, objects of the arrival cluster are inserted into an 
existing BH-tree or a new BH-tree is constructed. To test the efficiency of the 
proposed CV method, two other scenarios were proposed for comparison. In the 
first scenario, called the Creation of a New Index (CNI) method, for each arrival 
cluster, a BH-tree is constructed. In the second scenario, called the Insertion in an 
Existing Index (IEI) method, the objects of each arrival cluster are inserted in an 
existing BH-tree corresponding to the closest existing cluster. From the evaluation 
of BH-trees construction, the IEI method surpassed the CV and CNI methods. 
Parameters of the CV method are always located between those of the CNI and 
the IEI methods. For the parallel KNN query search, the three proposed methods 
where efficient compared with other methods from literature. The comparison of 
the proposed Cv method with the proposed scenarios showed that the CV method 
is more efficient than the CNI and the IEI methods in terms of the parallel kKNN 
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similarity query search and the energy consumption. 


In the TD method approach, the for layer, as for the B3CF-tree structure, is di- 
vided into a clustering level and an indexing level. In the clustering level, as for 
the CV method, the first data stream is grouped into clusters by means of the 
DBSCAN algorithm. Centers clusters are also determined. In the indexing level, 
Generalised Hyper-plane Trees (GHT) are constructed, in parallel, for each first 
cluster. The center of each first cluster is taken as a representative of the corre- 
sponding GHT. After the clustering and the determination of the clusters centers 
of the arrival data stream, objects in each arrival cluster are inserted in an existing 
GHT or a new GHT is constructed basing on the comparison of the distances be- 
tween the arrival cluster center and the existing GHT representatives to a threshold 
distance. To check the efficiency of the TD method, it was compared to a pro- 
posed scenario called the Creation of a New Tree (CNT). The experimental results 
showed that both methods are efficient compared with other indexing methods 
from literature. The experimental results showed also, that the TD method sur- 
passed the CNT method not only during the construction of GHT but also during 
the parallel kNN search of similarity queries. However, the TD method presented 


some weakness when compared with the CV method. 


In last proposition of this work, the fog node was not divided in the cloud-fog archi- 
tecture. In the fog node, the proposed QCCF-tree (Quad-tree based on Containers 
at the Cloud-Fog computing level) is based on the use of four balls with four pivots 
partitioning in metric space. This approach was proposed in order to overcome the 
problem of the efficiency of indexing, storing and retrieving of big loT data. The 
proposed QCCF-tree exhibited interesting and competitive experimental results 
either in the index construction or in the parallel kNN similarity query search in 


the inner of the QCCF-tree i.e. in the QCCF-tree nodes. The confrontation of 
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the experimental results to those of BCCF-tree, BB-tree, MX-tree and [WC-tree 
showed that the performances of the proposed QCCF-tree surpasses their largely 
whether it be in the index construction or in the similarity query search. The 
evaluation and the comparison results between QCCF-tree and B3CF-tree clearly 
showed that the efficiency of parallel similarity query search and the quality of 
B3CF-tree indexes exceeded those of the QCCF-tree. Indeed, the introduction of 
parallelism allowed by the DBSCAN clustering improved the construction char- 
acteristics of the B38CF-tree and also, significantly accelerated the kNN similarity 


queries search. 


As a future work, we will focus on the implementation of the algorithm in real 
IoT networks and testing real data from real situations. Despite the evidenced 
effect of parallelism in improving the indexing and the retrieving processes of big 
IoT data in term of time, it presents the disadvantage of cost i.e. the energy 
consumption. Indeed, the use of parallelism, especially when searching similarity 
queries, induced the exploration of all machines simultaneously while the answer 
is sent from only on machine. The proposition of an alternative of parallelism that 
reduces the energy consumption and guards the same efficiency of the proposed 


methods in this thesis work will be also considered. 
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